Summary of Monte Carlo Sampling

  • Want to sample from an un-normalized PDF $\pi_u$. (Set $\pi = $ normalized $\pi_u =\pi_u / Z$)

  • Discrete time: Find a Markov process $X$ with stationary distribution $\pi$

    • Irreducible + Aperiodic implies $\dist(X_n) \to \pi$ as $n \to \infty$.

    • Construct $X$ using a proposal mechanism and Metropolis Hastings.

  • Continuous time: Let $X$ solve $dX_t = b(X_t) \, dt + \sigma(X_t) \, dW_t$.

    • Find $b, \sigma$ so that $\mathcal L^* \pi_u = 0$. Implies stationary distribution of $X$ is $\pi$.

    • One choice: $\sigma = \sqrt{2} I$, $b = - \grad U$, $U = -\ln \pi_u$ (Langevin MC).

    • Simulate $X$: $X_{n+1} = X_n - \grad U(X_n) \tau + \sqrt{2\tau} \zeta_{n+1}$, where $\zeta_{n+1} \sim N(0, I)$, $\tau = $ time step.

  • MALA: Do LMC; but instead of accepting $X_{n+1}$ directly, add a Metropolis accept / reject step.

    • $\displaystyle A(x, y) = \frac{ \pi_u(y) Q(y, x) }{\pi_u(x) Q(x, y)}$, where $\displaystyle Q(x, y) \propto \exp\paren[\Big]{ \frac{- \abs{x - \grad U(x) - y}^2 }{2 \tau } }$.

    • MALA moves to regions of high probability faster than LMC.

Application: Pro-bit model

  • Given $N$ feature vectors $z_1, \dots, z_N \in \R^d$, with labels $y_1, \dots, y_N \in \set{0, 1}$.

  • Given $z \in \R^d$, want to predict the label.

  • Predict $Y = 1$ with probability $\varPhi( \beta \cdot z )$, where $\varPhi = $ CDF of $N(0, 1)$, and $\beta$ will be chosen.

  • Postulate some prior distribution for $\beta$, say $p(\beta)$.

  • Compute posterior $\pi$ using Bayes rule. $\displaystyle \pi_u(\beta) %= p(\beta) \prod_{i=1}^d g(z_i) \P( Y = y_i \given z_i ) = p(\beta) \prod_{i=1}^N \varPhi( z_i \cdot \beta )^{y_i} (1 - \varPhi( z_i \cdot \beta ))^{1 - y_i} \,. $

  • Sample from $\pi_u$ using LMC / MALA.

    • Posterior mean $\bar \beta \approx \frac{1}{M} \sum_{i = 1}^M \beta_i$, where $\beta_i$ are i.i.d. samples from $\pi_u$.

    • Given $z \in \R^d$, predict $Y = 1$ with probability $\varPhi( \bar \beta \cdot z )$.

  • Model is only good if the points with label $0$ and $1$ can be separated by a hyper-plane

Application: Simulated Annealing.

  • Given $F \colon \mathcal X \to \R$. Want to maximize or minimize $F$.

  • Stochastic Hill Climb / Valley descent.

    • Say we’re trying to maximize $F$. Start at a point $x \in \mathcal X$.

    • Choose a close by point $y$ randomly.

    • If $F(y) > F(x)$, move to $y$. Otherwise stay at $x$.

    • Repeat until your computational budget is exhausted, or until $F$ isn’t increasing much.

    • Works if $F$ is “nice” – could get stuck at local minima / maxima.

  • Simulated Annealing.

    • Sample from $\pi_u \propto e^{-\beta F}$. (Sometimes people say $e^{-F / T}$ instead, and call $T$ temperature).

    • Send $\beta \to \infty$ (or $T \to 0$). All samples will concentrate at global minima of $F$.

    • More expensive than hill climb; but avoids getting stuck at local minima.

    • To maximize $F$ instead, use simulated annealing on $-F$.

Why does Simulated Annealing work.

Let $\displaystyle Z(\beta) = \sum_{x \in \mathcal X} e^{-\beta F(x)} $, $\displaystyle\quad \pi_\beta(x) = \frac{e^{-\beta F(x)}}{Z(\beta)}$.

  1. What happens to $\pi_\beta$ as $\beta \to 0$?

    • $\pi_\beta \to $ uniform on all of $\mathcal X$.
  2. What happens to $\pi_\beta$ as $\beta \to \infty$?

    • $\pi_\beta \to $ uniform on $\set{\text{all global minima of $F$}}$.

Convergence of $\pi_\beta$ as $\beta \to \infty$

  • Remember $\displaystyle Z(\beta) = \sum_{x \in \mathcal X} e^{-\beta F(x)} $, $\displaystyle\quad \pi_\beta(x) = \frac{e^{-\beta F(x)}}{Z(\beta)}$.

  • Say $F$ attains a strict minimum at $x_0 \in \mathcal X$.

  • $\displaystyle \pi_\beta(y) = \frac{e^{-\beta F(y)}}{Z(\beta)} = \frac{e^{-\beta F(y)}}{e^{-\beta F(x_0)} + \sum_{x \neq x_0} e^{-\beta F(x)} } = \frac{e^{-\beta (F(y) - F(x_0))}}{1 + \sum_{x \neq x_0} e^{-\beta (F(x) - F(x_0)} } $

  • $\displaystyle \lim_{\beta \to \infty} \pi_\beta(y) = \begin{cases} 1 & y = x_0\,,\\ 0 & y \neq x_0\,,\\ \end{cases} $

Tempering

  • Tempering is an algorithm to sample from a density proportional to $e^{-F}$.

  • Simulated Annealing: Sample from $e^{-\beta F} / Z(\beta)$, and send $\beta \to \infty$.

  • For MALA / LMC, you can sample from a density proportional to $e^{-\beta F}$ by discretizing the SDE \begin{equation} dX_t = -\grad F(X_t) \, dt + \sqrt{ \frac{2}{\beta} } \, dW_t \end{equation}

  • Simulated Annealing: Fix $T$ (final time). Choose $\beta = \beta(t)$ so that $\beta(0)$ is very small, $\beta(T)$ is very large.

    • Usually want to increase $\beta$ faster as $\beta$ becomes larger.
  • Tempering: Instead of choosing $\beta(T)$ to be very large, choose $\beta(T) = 1$.

    • Tempering delivers samples from $e^{-F}$, where as simulated annealing delivers the minima of$F$.

    • Tempering tries to avoid getting stuck in local minima.

    • Worst case of Tempering / LMC are equally bad; but in many practical situations, tempering helps.

Travelling Salesman Problem

Example of Randomly Chosen Cities
  • Given $N$ points on the plane (cities).

  • Find a route that travels through each city exactly once, and returns to the starting point.

  • NP Hard problem.

  • Simulated Annealing was originally introduced to study this.

  • Better numerical numerical algorithms now using ant colonies.

What if you just tried random tours?

  • Cost fits perfectly on a bell curve (shown in orange).

  • Minimum cost (in this example) is about $5.5$

  • Odds of finding it randomly $5.66 \cdot 10^{-32}$.

  • Will take 10,000 times the life of the universe on a fast computer.

Greedy nearest neighbor algorithm

  • Start at a random point. Travel to the nearest point that hasn’t yet been visited.

  • Performs badly in some situations. Gets within 25% of the minimum for most configurations

Greedy nearest neighbor, with a valley descent

  • Start with the greedy nearest neighbor tour.

  • Randomly swap two cities (or cycle three) cities. If it makes the route shorter, use this route instead.

  • Repeat as long as you can afford.

  • Valley descent, where you’re minimizing the cost of a route.

Greedy nearest neighbor, with simulated annealing

  • Sample tours $\sigma$ from the distribution $e^{-\beta C(\sigma)}$ using Metropolis Hastings. ($C(\sigma) = $ cost of the tour $\sigma$.)

    • Proposal mechanism: Given a tour $\sigma$, randomly swap two cities (or cycle three cities).

    • Will sometimes accept longer tours.

  • Start with $\beta$ small, and send $\beta \to \infty$ along some schedule.

More aggressive annealing schedule

  • How you send $\beta \to \infty$ makes a difference.

  • Usually you want $\beta$ to increase faster as it becomes larger.

  • (Or you want $T = 1/\beta$ to decrease to $0$ exponentially.)

What if we started with a random route?

  • Don’t start with the greedy nearest neighbor route; start with a random one.

  • A valley descent (hill climb) works pretty well.

What if we started with a random route?

  • Simulated annealing works a bit better.

  • In high dimensional problems you usually only get good results if the cost function is “nice”.

TSP Art

  • Stipple an image, and solve TSP on it.

Stippled image

TSP Art Greedy

  • Vanilla greedy looks bad; can always remove loops and get a lower cost.

Greedy Loop erased greedy

Simulated Annealing vs Hill Climb

Comparison

Cracking Substitution Ciphers

  • Alice and Bob decide on a key:

    • E.g. Replace A with K, replace B with Z and so on.
  • Alice encodes her with this key, and sends it to Bob.

  • 28 symbols in the alphabet. (Upper case letters, space, period)

  • Brute force key guess requires trying $10^{29}$ combinations.

  • Even with $10^{12}$ tries in a second, this is a few hundred million years!

  • Can crack (normal language) messages with some cleverness in seconds.

Maximize fitness

  • Normal language isn’t random strings. Some letter combinations occur more often than others.

  • Download a few large collections of books. (I used 5 from Project Gutenberg.)

  • Compute frequencies of ngrams:

    • How often do two letter combinations AA, AB, AC etc. occur?

    • How often do three letter combinations AAA, AAB, AAC etc. occur?

    • (Good enough to use 3 / 4 letter combinations. Don’t need to go too high.)

  • Use this to define a fitness function $F$:

    • Given any string of letters $S$, the fitness $F(S) \in \R$.

    • $F$ is high for natural language strings, and low for “random” strings.

  • Not telling you the exact definition (it’s for you to find on the homework).

    • Once you figure it out, it’s just a few lines of code to implement!

Example of values of a fitness function I used

$F$ String
-396.925 asdf wer b ewar pou bewp ear bewapoiu b awer wapeoi b asdlk;j wav asdlfk jwe bva
-180.199 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor i
-50.641 A Markov chain or Markov process is a stochastic model describing a sequence of
-47.801 The Monte Carlo method uses random sampling to solve computational problems that

Cracking Substitution Ciphers

  • Say your fitness function is $F$ and $\mathcal C$ is the coded message.

  • We will crack the message by finding the key $\sigma$ that maximizes $F( \operatorname{decode}( \mathcal C, \sigma ) )$.

  • Option 1: Stochastic Hill Climb:

    • Start with a random guess $\sigma$.

    • Change it slightly by swapping two symbols and generate a new key $\tau$. That is, pick two letters randomly (say J and R). If $\sigma$ replaced J with A and R with T, then $\tau$ will replace J with T and R with A.

    • If $F( \operatorname{decode}( C, \tau ) ) > F( \operatorname{decode}( C, \sigma ) )$, then replace $\sigma$ with $\tau$.

    • Repeat for as long as you can, or until the message looks readable.

  • Option 2: Use Simulated Annealing:

    • Use Metropolis Hastings to randomly sample keys $\sigma$ with probability proportional to $\exp\paren{+\beta F(\operatorname{decode}(\mathcal C, \sigma) }$, and send $\beta \to \infty$.
  • Both work because the function we’re maximizing is nice. Simulated annealing works a bit better.

Example

N Iters $F$ Decoded message
0 -470.33 . R.SLDBERVW.LREPDS.SLD.GS.FEXFNS. LD.XRPFZO.G.LFAD.SDEMRP.LDFBM.LGP.PDZ GRZ.LDB
100 -374.97 STFS OPUEFHJSOFEMP S OPSR SNEANB STOPSAFMNLCSRSONVPS PEIFMSOPNUISORMSMPLTRFLSOPU
200 -328.22 RE SANUCEJZ AECMNS SAN IS PCGPHS RAN GEMPOB I APXN SNCTEM ANPUT AIM MNORIEO ANU
300 -301.30 CO SANULOVF AOLMNS SAN IS ELHE.S CAN HOMERT I AEZN SNLGOM ANEUG AIM MNRCIOR ANU
400 -294.07 CO SANULOVG AOLMNS SAN IS ELHE.S CAN HOMERK I AEQN SNLDOM ANEUD AIM MNRCIOR ANU
800 -222.94 TO NSERLOCK SOLMEN NSE IN ALUAYN TSE UOMAF. I SAWE NELDOM SEARD SIM MEFTIOF SER
1,300 -34.58 TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAZE SELDOM HEARD HIM MENTION HER
3,500 -24.30 TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER