Important note about the solutions: The solutions below are not the only possible solutions. They are also far more detailed, and far more complete, than we expected you to write for the exam. Do not worry if your solutions do not look like the reference! The reason for spelling these answers out in such great detail is (i) to make sure folks really understand the solutions, and (ii) to highlight the fact that real-world implementation of numerical methods is almost never as “cut and dry” as the pristine textbook exercises you're used to from your intro classes. Algorithms involving estimation, statistics, and floating-point numbers can inject a lot of complexity and uncertainty into the picture—but if you want to build systems that make useful predictions in the real world, you have to learn to navigate these messier problems.
That being said, we are not going to grade you to this same exacting standard: as long as you demonstrated clear thinking about the questions, and arrived at some reasonable (if not 100% perfect) answer, we will give you full credit.
Consider the error of an \(N\)-sample estimator \(\widehat{I}_N\), given by the random variable \(e_N := {I}_N - I\), and the corresponding bias, given by expected value of the error \(b_N := \mathbb{E}[\widehat{I}_N] - I\). An estimate is
Consistent. To check if the function is consistent, we can simply run estimate(N) for increasingly large values of N, and see whether it appears to be approaching the (known) true value of the integral \(I\). For instance, we might express our test in pseudocode as follows:
testConsistent(M) # use bigger M for more reliable check
for k in range( 0, M ): # loop over powers of 10
Ihat[k] = estimate(10^k) # compute an estimate using 10^k samples
LogLogPlot( (Ihat-I)/I ) # plot the relative error
If we see that the (log) relative error approaches zero at a predictable rate (e.g., the graph has some constant negative slope), then we can be quite certain that the estimator is consistent—more so, as we increase m (and in turn, spend more computation). Of course, there are other ways to quantify and inspect convergence—a log-log plot is just one of many possibilities.
Note that in our solution we run the estimator only for powers of 10, because running it for every value of N between 1 and \(10^m\) really doesn't provide much more information about whether the function is converging or not (just a lot of fine-grained detail that we don't actually need). Of course, this question wasn't about efficiency, and if you considered all the individual values, that's fine too.
Unbiased.
A reasonable procedure for checking for bias is to compute \(b_N = \mathbb E[ \hat I_N ] - I\) using Monte Carlo estimation, i.e., by calling estimate(N) \(k\) times, and checking that it becomes small as \(k\) becomes large (for many values of \(N\)).
testBiased(maxSamples, maxN): # use bigger values of maxSamples, maxN for more reliable check
for N in range( 1, maxN+1 ) # check whether b_N = 0 for many N
# estimate b_N via Monte Carlo, using an increasing number of samples k
sum_I_N[0] = 0
for k in range( 1, maxSamples+1 )
sum_I_N[k] = sum_I_N[k-1] + estimate(N)
b_N[k] = sum_I_N[k]/k - I # k-sample estimate of b_N
LogLogPlot( b_N/I ) # plot the relative error
# verify all LogLog plots vanish as k becomes large.
As in our consistency check, the log-log plots will indicate whether our numerical estimate of \(\mathbb{E}[\widehat{I}_N]\) is approaching the true integral \(I\) for each \(N\) in our test set.
As before, a log-log plot is just one of many ways to inspect the behavior of the estimator; other ways of quantifying this behavior are perfectly fine.
The short answer is that your friend can use rejection sampling, and if you ignore the implications of floating-point arithmetic (and instead imagine everything is computed using real numbers), then the ranking will be (5), (1), (2), (3)=(4).
In a bit more detail: to rejection sample sets (1) and (3)–(5), your friend can for instance ask for a pair of random samples \(p = (x,y)\) drawn uniformly from \([0,10]\), and check if inSet(p) is true. If so, they can return \(p\); otherwise, they can try again until the test passes. Likewise, for set (2) they can fill a \(10 \times 10\) array \(p\) with random values uniformly sampled from the set \(\{+1,-1\}\). Recall that the efficiency of rejection sampling depends purely on what fraction of the domain is covered by the set in question. Among the subsets of \([0,10]^2\), the circular disks of set (5) cover the biggest fraction (\(\tfrac{\pi}{4}\) or about 78% of each square, hence the same fraction of the entire domain), followed by the black squares of the checkerboard (which cover 50% of the domain), followed by both (3) and (4) which both cover 0% of the domain: neither a set of points nor a curve cover any amount of area in a two-dimensional domain (If you want to get technical on this point—which you didn't need to do for the midterm!—these sets both have zero Lebesgue measure, which in this case is the right notion of “size” for talking about probability.). Finally, the number of Ising model states that are half spin-up and half spin-down is equal to the number of distinct ways of choosing 50 grid cells out of the total 100 cells, i.e., \[ n_{\text{equal}} := {100 \choose 50} = \frac{100!}{50!(100-50)!}.\] Since there are \(n_{\text{configurations}} := 2^{100}\) ways of assigning either \(+1\) or \(-1\) to each of 100 grid cells, the probability that a configuration chosen uniformly at random is an equal split is therefore \(n_{\text{equal}}/n_{\text{configurations}} \approx 0.079\) (i.e., about 8%). So, the chance that rejection sampling turns up an evenly-split Ising state is much lower than the chance of picking a point on the checkerboard, or in one of the circular disks—but still larger than zero!
The longer, and much more difficult question, is: how does this algorithm play out in the real world (rather than the pristine-yet-highly-artificial Platonic mathematical universe), where it has to be implemented using finite precision floating-point arithmetic? Now, we no longer have an “easy out” on sets (3) and (4), since there are a finite number of points on the floating point grid with coordinates between 0 and 10, and some (but not all) of the points in sets (3) and (4) also belong to this finite grid. Hence the real-world probability of picking a point from one of these sets is nonzero (but still very, very small). How small exactly, and which one is more probable? One useful fact to know about floating point numbers is that they can exactly represent integers up to a certain size. In particular, standard 32- and 64-bit floats can exactly represent integers with magnitude up to \(2^{24}\) and \(2^{53}\), respectively. Certainly big enough to include all 100 corners of our \(10 \times 10\) grid! To determine the probability that we pick one of these 100 points, we also need to know how many points total from the floating point grid are contained in the region \([0,10]^2\). There are a variety of ways we could bound this number, but we can also compute it exactly via brute force, e.g., using the nextafter function in Python (starting at 0.0, and counting up to 10.0). For a 32-bit float we get about a billion values in the range \([0,10]\) (1,092,616,193), or about one quintillion within the square \([0,10]^2 \subset \mathbb{R}^2\). Finally, to determine whether set (3) or (4) is bigger we need to know how many points of the floating point grid coincide with points of the circle of radius 5 centered on (5,5). Since the grid includes all integer coordinates, we already know there are at least some such points: \((10,5)\), \((5,10)\), \((0,5)\), and \((5,0)\). But are there more than 100? Even this question requires some interpretation of the question at hand: do we want to know (i) how many points on the true circle \(\{(x,y) \in \mathbb{R}^2 | \sqrt{(x-5)^2 + (y-5)^2} = 5\}\) are contained in the floating-point grid, or (ii) how many pairs of floating point numbers x,y satisfy the condition (x-5.0)\*(x-5.0) + (y-5.0)\*(y-5.0) == 5.0\*5.0 in floating-point arithmetic (taking things like rounding errors into account)? While the first question is “interesting”, the second question will tell us something about the real-life behavior we'll actually observe on our computer—where the test inSet is also implemented in floating point. Here, to find such pairs we can just “guess and check.” For instance, we might again generate floating point values x in the interval [0,10] by incrementing from zero, and compute complementary floating point values y = 5. - sqrt(10.\*x-x\*x). In the absence of rounding errors, these two coordinates should satisfy the floating point condition (ii) given above. Indeed, if we run this test in 32-bit floating point, we quickly find far more than 100 points on the floating-point circle. (But still far fewer than the number of equal-parity Ising states—especially if we go to 64-bit precision!)
Hence, if we take the floating point representation into account, the final ordering would be (5), (1), (2), (4), (3).
### Problem 4: Making π with Antithetic SamplingRecall that the variance of a Monte Carlo estimator depends on (i) the variance of the integrand and (ii) the number of samples \(N\). Since we're interested in reducing the variance for the same total number of samples, we can assume \(N\) is fixed and simply inspect the variance of the two possible integrands: the original integrand, and the “symmetrized” function \(\bar{f}(p) := \tfrac{1}{2}(f(p)+f(\eta(p))),\) where \(\eta(p) := -(p+c)+c\) is the antipodal map relative to \(c\).
Since the original integrand \(f\) looks identical (up to reflections) in all four quadrants, its mean value \(\mu\) can be computed via \[ \mu(f) = \frac{1}{|[0,1]^2|} \int_{[0,1]^2} f(p)\ dp = \tfrac{1}{1} \cdot \tfrac{\pi}{4} = \tfrac{\pi}{4}. \] In other words: the function is equal to 1 over a quarter of a unit circle (area \(\pi/4\)), and zero elsewhere in a box of area 1, so the mean value over the quadrant (and hence the whole \(2 \times 2\) box) is \(\pi/4\). The variance \(\sigma^2\) of the original integrand is likewise given by the mean squared deviation from this average value: \[ \sigma^2(f) = \int_{[0,1]^2} (\mu(f)-f(p))^2\ dp = \underbrace{\tfrac{\pi}{4}(\mu(f) - 1)^2}_{\text{region covered by circle}} + \underbrace{(1-\tfrac{\pi}{4})(\mu(f) - 0)^2}_{\text{region not covered by circle}} = \tfrac{\pi}{16}(4-\pi) \approx 0.169. \]
We can easily perform an analogous calculation for the symmetrized function \(\bar{f}(p)\) by noting that this function is equal to \(\tfrac{1}{2}\) on two regions of area \(1-\tfrac{\pi}{4}\), for a total area of \(A_{\tfrac{1}{2}} := 2-\pi/2\), and equal to \(1\) on the remaining region of size \(A_{1} := 1 - A_{\tfrac{1}{2}} = \tfrac{\pi}{2}-1\). Hence the mean value over the square is given by \[ \mu(\bar{f}) = A_{\tfrac{1}{2}} \cdot \tfrac{1}{2} + A_{1} \cdot 1 = \tfrac{\pi}{4}. \]
The fact that the mean of the function has not changed should come as no surprise: by linearity of expectation we have \(\mathbb{E}[\tfrac{1}{2}(f+f \circ \eta)] = \tfrac{1}{2} \mathbb{E}(f) + \tfrac{1}{2}\mathbb{E}[f \circ \eta]\), and since \(f \circ \eta\) is merely a reflection of \(f\), its average is unchanged, i.e., we just have half the average of \(f\) times two. However, determining the areas of the two piecewise constant regions of \(\bar{f}\) will be helpful in computing its variance. In particular, we have \[ \sigma^2(\tilde{f}) = \int_{[0,1]^2} (\mu(\tilde{f}) - \tilde{f}(p))^2\ dp = A_{\tfrac{1}{2}} (\tfrac{\pi}{4} - \tfrac{1}{2})^2 + A_{1} (\tfrac{\pi}{4} - 1)^2 = \tfrac{1}{16}(4-\pi)(\pi-2) \approx 0.061. \] Finally, the variance of an antithetic estimator that uses \(N\) pairs of sample points is given by \(\sigma^2(\tilde{f})/N\), whereas a basic \(N\)-sample Monte Carlo estimator that uses the same total number of sample points has variance \(\sigma^2(f)/(2N)\), giving a multiplicative reduction in variance of \(2 \sigma^2(\tilde{f}) / \sigma^2(f) \approx 0.722\). Hence, even for the simple task of estimating \(\pi\), antithetic sampling does pay off—reducing the variance of the basic estimator by about 28%.Another way to reach this same conclusion would be to show that the covariance is negative.
### Problem 5: Rejection Sampling from a Multimodal Distribution Download this Jupyter notebook: [download here](midterm.ipynb). In this coding task, you will implement two strategies, both based on rejection sampling, for sampling uniformly from a multimodal distribution described by a probability density function \(p(x)\). In particular, the two strategies are: * Rejection sampling, using an upper bound on all of \(p(x)\). * Rejection sampling, using separate upper bounds on each mode. For generating an equal number of samples \(N\), your code should be more efficient with the second strategy (meaning it rejects fewer candidates, and thus takes fewer trials to achieve the same number \(N\) of samples.) Note: You don't have to empirically verify that your second strategy does better, but if you implement some way of checking it, you'll have more confidence that you've done well on the exam! ;-)) All instructions are contained within text cells in the notebook. When you're done implementing all the specified functions in the notebook, submit your completed notebook on Gradescope **with any output left visible in the notebook**.A Jupyter notebook containing reference implementations: download here.
A common error in \(\tt{multiBoundRejectionSample()}\) was not choosing which mode to sample from with probability in proportion to the areas of your chosen regions (bounding boxes) enclosing the modes. For example, let's say you choose an upper bound of \(y_1\) on the interval \([0, x]\) and an upper bound of \(y_2\) on \([x, 1]\) (assuming a PDF with two modes supported on \([0,1]\)). This corresponds to rejection sampling using two regions: the box \([0, x] \times [0, y_1]\) and the box \([x, 1]\times[0, y_2]\). The first box has area \(A_1 = xy_1\), and the second box has area \(A_2 = (1-x)y_2\). So you must choose to sample from the \(i\)th mode with probability \(\frac{A_i}{A_1+A_2}\). Otherwise you will not be rejection sampling from the region under \(p(x)\) uniformly. Another common mistake was choosing to sample the first mode in proportion to \(\int_0^x p(x') dx'\), etc. But this is incorrect.
Additionally, some students pre-computed a fixed number of samples to sample from each mode, namely equal to \(N \cdot \text{[area under $p(x)$ in each mode]}\). This is a form of sample stratification, and is unnecessary but not technically incorrect.