Monte Carlo Methods and Applications

# [Monte Carlo Methods and Applications](../index.html) ### CMU 21-387 | 15-327 | 15-627 | 15-860 FALL 2023 [Home](../index.html) ## Midterm (Fall 2023) ### Instructions Your take-home midterm is comprised of the five problems below (four written, one coding). Each question is worth the same amount (20%). The midterm must be handed in no later than **4:00pm EDT on Monday, October 9th.** Please note that _this deadline is different from the usual homework due date!_ You can hand-in your exam solutions the same way you hand in regular assignments (via Gradescope). Unlike assignments, **the midterm is NOT collaborative**. You must complete these problems completely on your own, without discussing with your classmates.

Important note about the solutions: The solutions below are not the only possible solutions. They are also far more detailed, and far more complete, than we expected you to write for the exam. Do not worry if your solutions do not look like the reference! The reason for spelling these answers out in such great detail is (i) to make sure folks really understand the solutions, and (ii) to highlight the fact that real-world implementation of numerical methods is almost never as “cut and dry” as the pristine textbook exercises you're used to from your intro classes. Algorithms involving estimation, statistics, and floating-point numbers can inject a lot of complexity and uncertainty into the picture—but if you want to build systems that make useful predictions in the real world, you have to learn to navigate these messier problems.

That being said, we are not going to grade you to this same exacting standard: as long as you demonstrated clear thinking about the questions, and arrived at some reasonable (if not 100% perfect) answer, we will give you full credit.

### Problem 1: Checking for Bias and Consistency Suppose you're given a black-box function `estimate(N)` which computes an estimate of the integral _I_ of a function _f_ with finite variance, using `N` samples. However, you do not know ahead of time whether this black box estimator is unbiased or even consistent. Sketch a strategy for checking, empirically, whether the `estimate` function satisfies these two properties. In other words, for each property describe an algorithm to check whether it holds to reasonable degree of certainty (so, two algorithms total). In particular, you should be able to get more and more confidence about the answers by spending more and more compute time. **Important:** _you may assume that you know the true value of I_. You can also assume that, if the black-box function uses random sampling, it uses a different random seed upon each invocation. **Note:** your algorithm can be described in words and/or in pseudocode, but it _must be detailed enough that someone could implement it_ without knowing anything about Monte Carlo, probability, etc. In other words, the algorithm you describe should consist of only basic low-level programming constructs such as loops, arithmetic, etc. Answers that give high-level descriptions like _“…then compute the expected value…”_ will receive no points.

Consider the error of an $N$-sample estimator $\widehat{I}_N$, given by the random variable $e_N := {I}_N - I$, and the corresponding bias, given by expected value of the error $b_N := \mathbb{E}[\widehat{I}_N] - I$. An estimate is

consistent if $\lim_{N \to \infty} e_N = 0$ with probability 1, and
unbiased if $b_N = 0$ for all $N \geq 1$.

In other words, consistent means that we approach the answer as we take more samples, whereas unbiased means that for every fixed $N$—including just a single sample $N=1$—we are no more likely to underestimate the solution than to overestimate it.

Consistent. To check if the function is consistent, we can simply run estimate(N) for increasingly large values of N, and see whether it appears to be approaching the (known) true value of the integral $I$. For instance, we might express our test in pseudocode as follows:

testConsistent(M) # use bigger M for more reliable check
   for k in range( 0, M ): # loop over powers of 10
      Ihat[k] = estimate(10^k) # compute an estimate using 10^k samples
   LogLogPlot( (Ihat-I)/I ) # plot the relative error

If we see that the (log) relative error approaches zero at a predictable rate (e.g., the graph has some constant negative slope), then we can be quite certain that the estimator is consistent—more so, as we increase m (and in turn, spend more computation). Of course, there are other ways to quantify and inspect convergence—a log-log plot is just one of many possibilities. Note that in our solution we run the estimator only for powers of 10, because running it for every value of N between 1 and $10^m$ really doesn't provide much more information about whether the function is converging or not (just a lot of fine-grained detail that we don't actually need). Of course, this question wasn't about efficiency, and if you considered all the individual values, that's fine too.
Better than using an arbitrary maximum number of samples (1 million in this case) might be to see if we can compute the true integral up to, say, the full precision of a 32- or 64-bit floating point number. Of course, since we don't have access to the function itself, we can never be 100% certain! But a positive result for this test is a very reliable indicator that all is well.

Unbiased. A reasonable procedure for checking for bias is to compute $b_N = \mathbb E[ \hat I_N ] - I$ using Monte Carlo estimation, i.e., by calling estimate(N) $k$ times, and checking that it becomes small as $k$ becomes large (for many values of $N$).

testBiased(maxSamples, maxN): # use bigger values of maxSamples, maxN for more reliable check
  for N in range( 1, maxN+1 ) # check whether b_N = 0 for many N
    # estimate b_N via Monte Carlo, using an increasing number of samples k
    sum_I_N[0] = 0
    for k in range( 1, maxSamples+1 )
      sum_I_N[k] = sum_I_N[k-1] + estimate(N)
      b_N[k] = sum_I_N[k]/k - I # k-sample estimate of b_N
    LogLogPlot( b_N/I ) # plot the relative error
    # verify all LogLog plots vanish as k becomes large.

As in our consistency check, the log-log plots will indicate whether our numerical estimate of $\mathbb{E}[\widehat{I}_N]$ is approaching the true integral $I$ for each $N$ in our test set. As before, a log-log plot is just one of many ways to inspect the behavior of the estimator; other ways of quantifying this behavior are perfectly fine.

### Problem 2: Constantly Thinking About Markov Processes Let $X_0, X_1, \dots$ be an irreducible, time homogeneous, Markov chain on a finite state space $\mathcal X$. Suppose $f \colon \mathcal X \to \mathbb{R}$ is a function such that for all $x \in \mathcal X$ we have \[ f(x) = \mathbb{E} ( f(X_1) ~|~ X_0 = x )\,. \] Must $f$ be a constant function? If yes, prove it. If no, find an example of a non-constant function $f$ and an irreducible Markov chain that satisfies the above property. **Note:** Here $\mathbb{E} ( f(X_1) ~|~ X_0 = x ) = \sum_{y \in \mathcal X} P(x, y) f(y)$, where $P$ is the transition matrix of the Markov chain.

Using the Markov property and time homogeneity we have \begin{align} \mathbb{E}( f(X_{n+1}) ~|~ X_0 = x ) &= \sum_{y \in \mathcal X} \mathbb{E}( f(X_{n+1}) ~|~ X_n = y ) \mathbb{P}( X_n = y ~|~ X_0 = x ) \\ &= \sum_{y \in \mathcal X} \mathbb{E}( f(X_{1}) ~|~ X_0 = y ) \mathbb{P}( X_n = y ~|~ X_0 = x ) \\ &= \sum_{y \in \mathcal X} f(y) \mathbb{P}( X_n = y ~|~ X_0 = x ) = \mathbb{E}( f(X_n) ~|~ X_0 = x )\,. \end{align} Thus, by induction, we must have \begin{equation} f(x) = \mathbb{E} ( f(X_n) ~|~ X_0 = x )\,, \end{equation} for every $n \in \mathbb{N}$ and $x \in \mathcal X$. Let $x_0 \in \mathcal X$ be the point where $f$ attains its maximum. Let $y \in \mathcal X$. By irreducibility, there exists $n \in \mathbb{N}$ such that $P^n(x_0, y) \gt 0$. If for this $y$ we had $f(y) \lt f(x_0)$, then we would have \begin{equation} f(x_0) = \sum_{y' \in \mathcal X} P^n(x_0, y') f(y') \lt \sum_{y' \in \mathcal X} P^n(x_0, y') f(x_0) = f(x_0)\,, \end{equation} which is impossible. Thus we must have $f(y) = f(x_0)$, showing that $f$ is a constant. **Alternate Solution:** The function $f$ is a solution to the matrix equation $Pf = f$, which is equivalent to saying $f \in \operatorname{ker}(P - I)$, where $I$ is the identity matrix. Thus, in order to show the constant functions are the only ones with the desired property, it is enough to show that the dimension of $ \operatorname{ker}(P - I) = 1 $. Since the row rank and column rank of matrices are equal, this is equivalent to showing that the dimension of $ \operatorname{ker}(P^T - I) = 1 $. If $g \in \operatorname{ker}(P^T - I) $, then we certainly have \begin{equation} g^T P = g^T \,. \end{equation} Note that if $g(x) \geq 0$ for all $x$, and $\sum g(x) = 1$ then $\mu = g^T$ is a stationary distribution for $X$. Since $X$ is irreducible, the stationary distribution is unique and so there is only one such $g$. Let's call this $\mu = g^T$. Of course, there may be functions $g$ which are not necessarily non-negative, or whose values don't sum to 1 that still satisfy $g^T P = g^T$. **In order to get full credit using this method, you need to realize this and show that any such function is a scalar multiple of $\mu$.** We will do this as follows: We've show before than $\mu(x) \gt 0$ for all $x \in \mathcal X$, and so we can choose $ \lambda = \max\limits_{x \in \mathcal X} -g(x) / \mu(x) $. The point of this choice is to guarantee \[ g(x) + \lambda \mu(x) \geq 0 \] for every $x \in \mathcal X$. Thus if we set \[ \nu(x) = \frac{1}{Z} (g(x) + \lambda \mu(x)) \quad\text{where } Z = \sum_{y \in \mathcal X} (g(y) + \lambda \mu(y)) = \sum_{y \in \mathcal X} g(y) + \lambda \] we see that $\nu$ is a stationary distribution for $P$. Thus (by uniqueness of the stationary measure) $\nu$ is a scalar multiple of $\mu$, which in turn implies $g$ is a scalar multiple of $\mu$, showing that the dimension of $\operatorname{ker}(P^T - I) = 1$, concluding the proof.

### Problem 3: Sampling in the Dark ![The five sets](images/sets.svg) Consider the sets described below: 1. The set of all black squares on a $10 \times 10$ checkerboard, i.e., the union of all squares with index $(i,j) \in \{1, \ldots, 10\}^2$ such that $i+j \equiv 1\ (\!\!\!\!\mod{} 2)$. 2. The set of configurations in an Ising model on a $10 \times 10$ grid where exactly 50% of the states are “spin up” and 50% are “spin down.” In other words, the set of functions $\sigma: \{1, \ldots, 10\}^2 \to \{+1,-1\}$ such that $\sum_{i=1}^{10} \sum_{j=1}^{10} \sigma(i,j) = 0$. 3. The set of all corners of a $10 \times 10$ grid, i.e., the set $\{0, \ldots, 10\}^2 \subset [0,10]^2$. 4. The circle of radius 5, centered on a $10 \times 10$ grid, i.e., the set $\{(x,y) \in [0,10]^2 | (x-5)^2 + (y-5)^2 = 5^2\}$. 5. The union of all circular disks of radius 1/2, centered on each cell of a $10 \times 10$ grid, i.e., the set $\cup_{i=1}^{10} \cup_{j=1}^{10} D_{ij}$ where $D_{ij} := \{(x,y) \in [0,10]^2 | (i-\tfrac{1}{2}-x)^2 + (j-\tfrac{1}{2}-y)^2 \leq (1/2)^2\}$. Suppose your friend is trying to draw uniformly random samples from these sets, but has access to them only via black-box functions `inSetN(p)` (for `N = 1, ..., 5`). Each function takes as input a point `p`, returning `True` if the point is in the set and `False` otherwise. In particular, the function in item (2) takes a point $p \in \{+1,-1\}^{10 \times 10}$ (i.e., a $10 \times 10$ array of values $\pm 1$), and the functions in items (1), (3), (4,), and (5) take a point $p \in [0,10]^2$, i.e., a pair of coordinates $(x,y)$ in the $10 \times 10$ box. Importantly, your friend _does not have access to the descriptions above_, i.e., they do not know how these functions are defined, nor what the sets look like. All they know is (i) the domain of each function, and (ii) that the function returns true if and only if the given point is inside the region (and false otherwise). Give a strategy your friend could use to generate samples from these sets, and rank the cost (using this strategy) of sampling the five different sets from “easiest” to “hardest”, possibly with ties. You can ignore the cost of evaluating `inSet` (e.g., assume it's negligible relative to the cost of sample generation), and should also assume that all calculations are performed in fixed-precision floating point.

The short answer is that your friend can use rejection sampling, and if you ignore the implications of floating-point arithmetic (and instead imagine everything is computed using real numbers), then the ranking will be (5), (1), (2), (3)=(4).

In a bit more detail: to rejection sample sets (1) and (3)–(5), your friend can for instance ask for a pair of random samples $p = (x,y)$ drawn uniformly from $[0,10]$, and check if inSet(p) is true. If so, they can return $p$; otherwise, they can try again until the test passes. Likewise, for set (2) they can fill a $10 \times 10$ array $p$ with random values uniformly sampled from the set $\{+1,-1\}$. Recall that the efficiency of rejection sampling depends purely on what fraction of the domain is covered by the set in question. Among the subsets of $[0,10]^2$, the circular disks of set (5) cover the biggest fraction ($\tfrac{\pi}{4}$ or about 78% of each square, hence the same fraction of the entire domain), followed by the black squares of the checkerboard (which cover 50% of the domain), followed by both (3) and (4) which both cover 0% of the domain: neither a set of points nor a curve cover any amount of area in a two-dimensional domain (If you want to get technical on this point—which you didn't need to do for the midterm!—these sets both have zero Lebesgue measure, which in this case is the right notion of “size” for talking about probability.). Finally, the number of Ising model states that are half spin-up and half spin-down is equal to the number of distinct ways of choosing 50 grid cells out of the total 100 cells, i.e., \[ n_{\text{equal}} := {100 \choose 50} = \frac{100!}{50!(100-50)!}.\] Since there are $n_{\text{configurations}} := 2^{100}$ ways of assigning either $+1$ or $-1$ to each of 100 grid cells, the probability that a configuration chosen uniformly at random is an equal split is therefore $n_{\text{equal}}/n_{\text{configurations}} \approx 0.079$ (i.e., about 8%). So, the chance that rejection sampling turns up an evenly-split Ising state is much lower than the chance of picking a point on the checkerboard, or in one of the circular disks—but still larger than zero!

The longer, and much more difficult question, is: how does this algorithm play out in the real world (rather than the pristine-yet-highly-artificial Platonic mathematical universe), where it has to be implemented using finite precision floating-point arithmetic? Now, we no longer have an “easy out” on sets (3) and (4), since there are a finite number of points on the floating point grid with coordinates between 0 and 10, and some (but not all) of the points in sets (3) and (4) also belong to this finite grid. Hence the real-world probability of picking a point from one of these sets is nonzero (but still very, very small). How small exactly, and which one is more probable? One useful fact to know about floating point numbers is that they can exactly represent integers up to a certain size. In particular, standard 32- and 64-bit floats can exactly represent integers with magnitude up to $2^{24}$ and $2^{53}$, respectively. Certainly big enough to include all 100 corners of our $10 \times 10$ grid! To determine the probability that we pick one of these 100 points, we also need to know how many points total from the floating point grid are contained in the region $[0,10]^2$. There are a variety of ways we could bound this number, but we can also compute it exactly via brute force, e.g., using the nextafter function in Python (starting at 0.0, and counting up to 10.0). For a 32-bit float we get about a billion values in the range $[0,10]$ (1,092,616,193), or about one quintillion within the square $[0,10]^2 \subset \mathbb{R}^2$. Finally, to determine whether set (3) or (4) is bigger we need to know how many points of the floating point grid coincide with points of the circle of radius 5 centered on (5,5). Since the grid includes all integer coordinates, we already know there are at least some such points: $(10,5)$, $(5,10)$, $(0,5)$, and $(5,0)$. But are there more than 100? Even this question requires some interpretation of the question at hand: do we want to know (i) how many points on the true circle $\{(x,y) \in \mathbb{R}^2 | \sqrt{(x-5)^2 + (y-5)^2} = 5\}$ are contained in the floating-point grid, or (ii) how many pairs of floating point numbers x,y satisfy the condition (x-5.0)\*(x-5.0) + (y-5.0)\*(y-5.0) == 5.0\*5.0 in floating-point arithmetic (taking things like rounding errors into account)? While the first question is “interesting”, the second question will tell us something about the real-life behavior we'll actually observe on our computer—where the test inSet is also implemented in floating point. Here, to find such pairs we can just “guess and check.” For instance, we might again generate floating point values x in the interval [0,10] by incrementing from zero, and compute complementary floating point values y = 5. - sqrt(10.\*x-x\*x). In the absence of rounding errors, these two coordinates should satisfy the floating point condition (ii) given above. Indeed, if we run this test in 32-bit floating point, we quickly find far more than 100 points on the floating-point circle. (But still far fewer than the number of equal-parity Ising states—especially if we go to 64-bit precision!)

Hence, if we take the floating point representation into account, the final ordering would be (5), (1), (2), (4), (3).

### Problem 4: Making π with Antithetic Sampling ![Quadrant of a circle with example of reflected point](images/antithetic.svg) Consider integrating $\pi$ using antithetic sampling, i.e., we want to integrate the indicator function $f(p) := 1 - \text{step}(\|p\| - 1)$ over the square $p \in [-1,1] \times [-1,1]$, where $\text{step}(x)$ is the step function equal to zero for $x \leq 0$ and one for $x > 0$. Performing a reflection around the origin ($x \mapsto -x$) obviously doesn't make any difference, because the integrand $(f(x) + f(-x))/2$ will look just like the original integrand $f(x)$. Suppose we instead integrate over just the upper-right quadrant $p \in [0,1] \times [0,1]$, using pairs of samples $f(p)$, $f(-(p-c)+c)$, where $c := (1/2,1/2)$ is the center of this quadrant. Does this scheme reduce variance relative to taking the same total number of samples with basic Monte Carlo? You must justify your answer in _quantitative_ terms.

Recall that the variance of a Monte Carlo estimator depends on (i) the variance of the integrand and (ii) the number of samples $N$. Since we're interested in reducing the variance for the same total number of samples, we can assume $N$ is fixed and simply inspect the variance of the two possible integrands: the original integrand, and the “symmetrized” function $\bar{f}(p) := \tfrac{1}{2}(f(p)+f(\eta(p))),$ where $\eta(p) := -(p+c)+c$ is the antipodal map relative to $c$.

Since the original integrand $f$ looks identical (up to reflections) in all four quadrants, its mean value $\mu$ can be computed via \[ \mu(f) = \frac{1}{|[0,1]^2|} \int_{[0,1]^2} f(p)\ dp = \tfrac{1}{1} \cdot \tfrac{\pi}{4} = \tfrac{\pi}{4}. \] In other words: the function is equal to 1 over a quarter of a unit circle (area $\pi/4$), and zero elsewhere in a box of area 1, so the mean value over the quadrant (and hence the whole $2 \times 2$ box) is $\pi/4$. The variance $\sigma^2$ of the original integrand is likewise given by the mean squared deviation from this average value: \[ \sigma^2(f) = \int_{[0,1]^2} (\mu(f)-f(p))^2\ dp = \underbrace{\tfrac{\pi}{4}(\mu(f) - 1)^2}_{\text{region covered by circle}} + \underbrace{(1-\tfrac{\pi}{4})(\mu(f) - 0)^2}_{\text{region not covered by circle}} = \tfrac{\pi}{16}(4-\pi) \approx 0.169. \]

We can easily perform an analogous calculation for the symmetrized function $\bar{f}(p)$ by noting that this function is equal to $\tfrac{1}{2}$ on two regions of area $1-\tfrac{\pi}{4}$, for a total area of $A_{\tfrac{1}{2}} := 2-\pi/2$, and equal to $1$ on the remaining region of size $A_{1} := 1 - A_{\tfrac{1}{2}} = \tfrac{\pi}{2}-1$. Hence the mean value over the square is given by \[ \mu(\bar{f}) = A_{\tfrac{1}{2}} \cdot \tfrac{1}{2} + A_{1} \cdot 1 = \tfrac{\pi}{4}. \]

The fact that the mean of the function has not changed should come as no surprise: by linearity of expectation we have $\mathbb{E}[\tfrac{1}{2}(f+f \circ \eta)] = \tfrac{1}{2} \mathbb{E}(f) + \tfrac{1}{2}\mathbb{E}[f \circ \eta]$, and since $f \circ \eta$ is merely a reflection of $f$, its average is unchanged, i.e., we just have half the average of $f$ times two. However, determining the areas of the two piecewise constant regions of $\bar{f}$ will be helpful in computing its variance. In particular, we have \[ \sigma^2(\tilde{f}) = \int_{[0,1]^2} (\mu(\tilde{f}) - \tilde{f}(p))^2\ dp = A_{\tfrac{1}{2}} (\tfrac{\pi}{4} - \tfrac{1}{2})^2 + A_{1} (\tfrac{\pi}{4} - 1)^2 = \tfrac{1}{16}(4-\pi)(\pi-2) \approx 0.061. \] Finally, the variance of an antithetic estimator that uses $N$ pairs of sample points is given by $\sigma^2(\tilde{f})/N$, whereas a basic $N$-sample Monte Carlo estimator that uses the same total number of sample points has variance $\sigma^2(f)/(2N)$, giving a multiplicative reduction in variance of $2 \sigma^2(\tilde{f}) / \sigma^2(f) \approx 0.722$. Hence, even for the simple task of estimating $\pi$, antithetic sampling does pay off—reducing the variance of the basic estimator by about 28%.

Another way to reach this same conclusion would be to show that the covariance is negative.

### Problem 5: Rejection Sampling from a Multimodal Distribution Download this Jupyter notebook: [download here](midterm.ipynb). In this coding task, you will implement two strategies, both based on rejection sampling, for sampling uniformly from a multimodal distribution described by a probability density function $p(x)$. In particular, the two strategies are: * Rejection sampling, using an upper bound on all of $p(x)$. * Rejection sampling, using separate upper bounds on each mode. For generating an equal number of samples $N$, your code should be more efficient with the second strategy (meaning it rejects fewer candidates, and thus takes fewer trials to achieve the same number $N$ of samples.) Note: You don't have to empirically verify that your second strategy does better, but if you implement some way of checking it, you'll have more confidence that you've done well on the exam! ;-)) All instructions are contained within text cells in the notebook. When you're done implementing all the specified functions in the notebook, submit your completed notebook on Gradescope **with any output left visible in the notebook**.

A Jupyter notebook containing reference implementations: download here.

A common error in $\tt{multiBoundRejectionSample()}$ was not choosing which mode to sample from with probability in proportion to the areas of your chosen regions (bounding boxes) enclosing the modes. For example, let's say you choose an upper bound of $y_1$ on the interval $[0, x]$ and an upper bound of $y_2$ on $[x, 1]$ (assuming a PDF with two modes supported on $[0,1]$). This corresponds to rejection sampling using two regions: the box $[0, x] \times [0, y_1]$ and the box $[x, 1]\times[0, y_2]$. The first box has area $A_1 = xy_1$, and the second box has area $A_2 = (1-x)y_2$. So you must choose to sample from the $i$th mode with probability $\frac{A_i}{A_1+A_2}$. Otherwise you will not be rejection sampling from the region under $p(x)$ uniformly. Another common mistake was choosing to sample the first mode in proportion to $\int_0^x p(x') dx'$, etc. But this is incorrect.

Additionally, some students pre-computed a fixed number of samples to sample from each mode, namely equal to $N \cdot \text{[area under $p(x)$ in each mode]}$. This is a form of sample stratification, and is unnecessary but not technically incorrect.

This page rendered with showdownjs.