reveal.js

Summary of Monte Carlo Sampling

Want to sample from an un-normalized PDF $\pi_u$. (Set $\pi = $ normalized $\pi_u =\pi_u / Z$)
Discrete time: Find a Markov process $X$ with stationary distribution $\pi$
- Irreducible + Aperiodic implies $\dist(X_n) \to \pi$ as $n \to \infty$.
- Construct $X$ using a proposal mechanism and Metropolis Hastings.
Continuous time: Let $X$ solve $dX_t = b(X_t) \, dt + \sigma(X_t) \, dW_t$.
- Find $b, \sigma$ so that $\mathcal L^* \pi_u = 0$. Implies stationary distribution of $X$ is $\pi$.
- One choice: $\sigma = \sqrt{2} I$, $b = - \grad U$, $U = -\ln \pi_u$ (Langevin MC).
- Simulate $X$: $X_{n+1} = X_n - \grad U(X_n) \tau + \sqrt{2\tau} \zeta_{n+1}$, where $\zeta_{n+1} \sim N(0, I)$, $\tau = $ time step.
MALA: Do LMC; but instead of accepting $X_{n+1}$ directly, add a Metropolis accept / reject step.
- $\displaystyle A(x, y) = \frac{ \pi_u(y) Q(y, x) }{\pi_u(x) Q(x, y)}$, where $\displaystyle Q(x, y) \propto \exp\paren[\Big]{ \frac{- \abs{x - \grad U(x) - y}^2 }{2 \tau } }$.
- MALA moves to regions of high probability faster than LMC.

Application: Pro-bit model

Given $N$ feature vectors $z_1, \dots, z_N \in \R^d$, with labels $y_1, \dots, y_N \in \set{0, 1}$.
Given $z \in \R^d$, want to predict the label.
Predict $Y = 1$ with probability $\varPhi( \beta \cdot z )$, where $\varPhi = $ CDF of $N(0, 1)$, and $\beta$ will be chosen.
Postulate some prior distribution for $\beta$, say $p(\beta)$.
Compute posterior $\pi$ using Bayes rule. $\displaystyle \pi_u(\beta) %= p(\beta) \prod_{i=1}^d g(z_i) \P( Y = y_i \given z_i ) = p(\beta) \prod_{i=1}^N \varPhi( z_i \cdot \beta )^{y_i} (1 - \varPhi( z_i \cdot \beta ))^{1 - y_i} \,. $
Sample from $\pi_u$ using LMC / MALA.
- Posterior mean $\bar \beta \approx \frac{1}{M} \sum_{i = 1}^M \beta_i$, where $\beta_i$ are i.i.d. samples from $\pi_u$.
- Given $z \in \R^d$, predict $Y = 1$ with probability $\varPhi( \bar \beta \cdot z )$.
Model is only good if the points with label $0$ and $1$ can be separated by a hyper-plane

Application: Simulated Annealing.

Given $F \colon \mathcal X \to \R$. Want to maximize or minimize $F$.
Stochastic Hill Climb / Valley descent.
- Say we’re trying to maximize $F$. Start at a point $x \in \mathcal X$.
- Choose a close by point $y$ randomly.
- If $F(y) > F(x)$, move to $y$. Otherwise stay at $x$.
- Repeat until your computational budget is exhausted, or until $F$ isn’t increasing much.
- Works if $F$ is “nice” – could get stuck at local minima / maxima.
Simulated Annealing.
- Sample from $\pi_u \propto e^{-\beta F}$. (Sometimes people say $e^{-F / T}$ instead, and call $T$ temperature).
- Send $\beta \to \infty$ (or $T \to 0$). All samples will concentrate at global minima of $F$.
- More expensive than hill climb; but avoids getting stuck at local minima.
- To maximize $F$ instead, use simulated annealing on $-F$.

Why does Simulated Annealing work.

Let $\displaystyle Z(\beta) = \sum_{x \in \mathcal X} e^{-\beta F(x)} $, $\displaystyle\quad \pi_\beta(x) = \frac{e^{-\beta F(x)}}{Z(\beta)}$.

What happens to $\pi_\beta$ as $\beta \to 0$?
- $\pi_\beta \to $ uniform on all of $\mathcal X$.
What happens to $\pi_\beta$ as $\beta \to \infty$?
- $\pi_\beta \to $ uniform on $\set{\text{all global minima of $F$}}$.

Convergence of $\pi_\beta$ as $\beta \to \infty$

Remember $\displaystyle Z(\beta) = \sum_{x \in \mathcal X} e^{-\beta F(x)} $, $\displaystyle\quad \pi_\beta(x) = \frac{e^{-\beta F(x)}}{Z(\beta)}$.
Say $F$ attains a strict minimum at $x_0 \in \mathcal X$.
$\displaystyle \pi_\beta(y) = \frac{e^{-\beta F(y)}}{Z(\beta)} = \frac{e^{-\beta F(y)}}{e^{-\beta F(x_0)} + \sum_{x \neq x_0} e^{-\beta F(x)} } = \frac{e^{-\beta (F(y) - F(x_0))}}{1 + \sum_{x \neq x_0} e^{-\beta (F(x) - F(x_0)} } $
$\displaystyle \lim_{\beta \to \infty} \pi_\beta(y) = \begin{cases} 1 & y = x_0\,,\\ 0 & y \neq x_0\,,\\ \end{cases} $

Tempering

Tempering is an algorithm to sample from a density proportional to $e^{-F}$.
Simulated Annealing: Sample from $e^{-\beta F} / Z(\beta)$, and send $\beta \to \infty$.
For MALA / LMC, you can sample from a density proportional to $e^{-\beta F}$ by discretizing the SDE \begin{equation} dX_t = -\grad F(X_t) \, dt + \sqrt{ \frac{2}{\beta} } \, dW_t \end{equation}
Simulated Annealing: Fix $T$ (final time). Choose $\beta = \beta(t)$ so that $\beta(0)$ is very small, $\beta(T)$ is very large.
- Usually want to increase $\beta$ faster as $\beta$ becomes larger.
Tempering: Instead of choosing $\beta(T)$ to be very large, choose $\beta(T) = 1$.
- Tempering delivers samples from $e^{-F}$, where as simulated annealing delivers the minima of$F$.
- Tempering tries to avoid getting stuck in local minima.
- Worst case of Tempering / LMC are equally bad; but in many practical situations, tempering helps.

Travelling Salesman Problem

Given $N$ points on the plane (cities).
Find a route that travels through each city exactly once, and returns to the starting point.
NP Hard problem.
Simulated Annealing was originally introduced to study this.
Better numerical numerical algorithms now using ant colonies.

What if you just tried random tours?

Cost fits perfectly on a bell curve (shown in orange).
Minimum cost (in this example) is about $5.5$
Odds of finding it randomly $5.66 \cdot 10^{-32}$.
Will take 10,000 times the life of the universe on a fast computer.

Greedy nearest neighbor algorithm

Start at a random point. Travel to the nearest point that hasn’t yet been visited.
Performs badly in some situations. Gets within 25% of the minimum for most configurations

Greedy nearest neighbor, with a valley descent

Start with the greedy nearest neighbor tour.
Randomly swap two cities (or cycle three) cities. If it makes the route shorter, use this route instead.
Repeat as long as you can afford.
Valley descent, where you’re minimizing the cost of a route.

Greedy nearest neighbor, with simulated annealing

Sample tours $\sigma$ from the distribution $e^{-\beta C(\sigma)}$ using Metropolis Hastings. ($C(\sigma) = $ cost of the tour $\sigma$.)
- Proposal mechanism: Given a tour $\sigma$, randomly swap two cities (or cycle three cities).
- Will sometimes accept longer tours.
Start with $\beta$ small, and send $\beta \to \infty$ along some schedule.

More aggressive annealing schedule

How you send $\beta \to \infty$ makes a difference.
Usually you want $\beta$ to increase faster as it becomes larger.
(Or you want $T = 1/\beta$ to decrease to $0$ exponentially.)

What if we started with a random route?

Don’t start with the greedy nearest neighbor route; start with a random one.
A valley descent (hill climb) works pretty well.

What if we started with a random route?

Simulated annealing works a bit better.
In high dimensional problems you usually only get good results if the cost function is “nice”.

TSP Art

Stipple an image, and solve TSP on it.

Stippled image

TSP Art Greedy

Vanilla greedy looks bad; can always remove loops and get a lower cost.

Greedy Loop erased greedy

Simulated Annealing vs Hill Climb

Comparison

Cracking Substitution Ciphers

Alice and Bob decide on a key:
- E.g. Replace A with K, replace B with Z and so on.
Alice encodes her with this key, and sends it to Bob.
28 symbols in the alphabet. (Upper case letters, space, period)
Brute force key guess requires trying $10^{29}$ combinations.
Even with $10^{12}$ tries in a second, this is a few hundred million years!
Can crack (normal language) messages with some cleverness in seconds.

Maximize fitness

Normal language isn’t random strings. Some letter combinations occur more often than others.
Download a few large collections of books. (I used 5 from Project Gutenberg.)
Compute frequencies of ngrams:
- How often do two letter combinations AA, AB, AC etc. occur?
- How often do three letter combinations AAA, AAB, AAC etc. occur?
- (Good enough to use 3 / 4 letter combinations. Don’t need to go too high.)
Use this to define a fitness function $F$:
- Given any string of letters $S$, the fitness $F(S) \in \R$.
- $F$ is high for natural language strings, and low for “random” strings.
Not telling you the exact definition (it’s for you to find on the homework).
- Once you figure it out, it’s just a few lines of code to implement!

Example of values of a fitness function I used

$F$	String
-396.925	asdf wer b ewar pou bewp ear bewapoiu b awer wapeoi b asdlk;j wav asdlfk jwe bva
-180.199	Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor i
-50.641	A Markov chain or Markov process is a stochastic model describing a sequence of
-47.801	The Monte Carlo method uses random sampling to solve computational problems that

Cracking Substitution Ciphers

Say your fitness function is $F$ and $\mathcal C$ is the coded message.
We will crack the message by finding the key $\sigma$ that maximizes $F( \operatorname{decode}( \mathcal C, \sigma ) )$.
Option 1: Stochastic Hill Climb:
- Start with a random guess $\sigma$.
- Change it slightly by swapping two symbols and generate a new key $\tau$. That is, pick two letters randomly (say J and R). If $\sigma$ replaced J with A and R with T, then $\tau$ will replace J with T and R with A.
- If $F( \operatorname{decode}( C, \tau ) ) > F( \operatorname{decode}( C, \sigma ) )$, then replace $\sigma$ with $\tau$.
- Repeat for as long as you can, or until the message looks readable.
Option 2: Use Simulated Annealing:
- Use Metropolis Hastings to randomly sample keys $\sigma$ with probability proportional to $\exp\paren{+\beta F(\operatorname{decode}(\mathcal C, \sigma) }$, and send $\beta \to \infty$.
Both work because the function we’re maximizing is nice. Simulated annealing works a bit better.

Example

N Iters	$F$	Decoded message
0	-470.33	`. R.SLDBERVW.LREPDS.SLD.GS.FEXFNS. LD.XRPFZO.G.LFAD.SDEMRP.LDFBM.LGP.PDZ GRZ.LDB`
100	-374.97	`STFS OPUEFHJSOFEMP S OPSR SNEANB STOPSAFMNLCSRSONVPS PEIFMSOPNUISORMSMPLTRFLSOPU`
200	-328.22	`RE SANUCEJZ AECMNS SAN IS PCGPHS RAN GEMPOB I APXN SNCTEM ANPUT AIM MNORIEO ANU`
300	-301.30	`CO SANULOVF AOLMNS SAN IS ELHE.S CAN HOMERT I AEZN SNLGOM ANEUG AIM MNRCIOR ANU`
400	-294.07	`CO SANULOVG AOLMNS SAN IS ELHE.S CAN HOMERK I AEQN SNLDOM ANEUD AIM MNRCIOR ANU`
800	-222.94	`TO NSERLOCK SOLMEN NSE IN ALUAYN TSE UOMAF. I SAWE NELDOM SEARD SIM MEFTIOF SER`
1,300	-34.58	`TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAZE SELDOM HEARD HIM MENTION HER`
3,500	-24.30	`TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER`