Want to sample from an un-normalized PDF $\pi_u$. (Set $\pi = $ normalized $\pi_u =\pi_u / Z$)
Discrete time: Find a Markov process $X$ with stationary distribution $\pi$
Irreducible + Aperiodic implies $\dist(X_n) \to \pi$ as $n \to \infty$.
Construct $X$ using a proposal mechanism and Metropolis Hastings.
Continuous time: Let $X$ solve $dX_t = b(X_t) \, dt + \sigma(X_t) \, dW_t$.
Find $b, \sigma$ so that $\mathcal L^* \pi_u = 0$. Implies stationary distribution of $X$ is $\pi$.
One choice: $\sigma = \sqrt{2} I$, $b = - \grad U$, $U = -\ln \pi_u$ (Langevin MC).
Simulate $X$: $X_{n+1} = X_n - \grad U(X_n) \tau + \sqrt{2\tau} \zeta_{n+1}$, where $\zeta_{n+1} \sim N(0, I)$, $\tau = $ time step.
MALA: Do LMC; but instead of accepting $X_{n+1}$ directly, add a Metropolis accept / reject step.
$\displaystyle A(x, y) = \frac{ \pi_u(y) Q(y, x) }{\pi_u(x) Q(x, y)}$, where $\displaystyle Q(x, y) \propto \exp\paren[\Big]{ \frac{- \abs{x - \grad U(x) - y}^2 }{2 \tau } }$.
MALA moves to regions of high probability faster than LMC.
Given $N$ feature vectors $z_1, \dots, z_N \in \R^d$, with labels $y_1, \dots, y_N \in \set{0, 1}$.
Given $z \in \R^d$, want to predict the label.
Predict $Y = 1$ with probability $\varPhi( \beta \cdot z )$, where $\varPhi = $ CDF of $N(0, 1)$, and $\beta$ will be chosen.
Postulate some prior distribution for $\beta$, say $p(\beta)$.
Compute posterior $\pi$ using Bayes rule. $\displaystyle \pi_u(\beta) %= p(\beta) \prod_{i=1}^d g(z_i) \P( Y = y_i \given z_i ) = p(\beta) \prod_{i=1}^N \varPhi( z_i \cdot \beta )^{y_i} (1 - \varPhi( z_i \cdot \beta ))^{1 - y_i} \,. $
Sample from $\pi_u$ using LMC / MALA.
Posterior mean $\bar \beta \approx \frac{1}{M} \sum_{i = 1}^M \beta_i$, where $\beta_i$ are i.i.d. samples from $\pi_u$.
Given $z \in \R^d$, predict $Y = 1$ with probability $\varPhi( \bar \beta \cdot z )$.
Model is only good if the points with label $0$ and $1$ can be separated by a hyper-plane
Given $F \colon \mathcal X \to \R$. Want to maximize or minimize $F$.
Stochastic Hill Climb / Valley descent.
Say we’re trying to maximize $F$. Start at a point $x \in \mathcal X$.
Choose a close by point $y$ randomly.
If $F(y) > F(x)$, move to $y$. Otherwise stay at $x$.
Repeat until your computational budget is exhausted, or until $F$ isn’t increasing much.
Works if $F$ is “nice” – could get stuck at local minima / maxima.
Simulated Annealing.
Sample from $\pi_u \propto e^{-\beta F}$. (Sometimes people say $e^{-F / T}$ instead, and call $T$ temperature).
Send $\beta \to \infty$ (or $T \to 0$). All samples will concentrate at global minima of $F$.
More expensive than hill climb; but avoids getting stuck at local minima.
To maximize $F$ instead, use simulated annealing on $-F$.
Let $\displaystyle Z(\beta) = \sum_{x \in \mathcal X} e^{-\beta F(x)} $, $\displaystyle\quad \pi_\beta(x) = \frac{e^{-\beta F(x)}}{Z(\beta)}$.
What happens to $\pi_\beta$ as $\beta \to 0$?
What happens to $\pi_\beta$ as $\beta \to \infty$?
Remember $\displaystyle Z(\beta) = \sum_{x \in \mathcal X} e^{-\beta F(x)} $, $\displaystyle\quad \pi_\beta(x) = \frac{e^{-\beta F(x)}}{Z(\beta)}$.
Say $F$ attains a strict minimum at $x_0 \in \mathcal X$.
$\displaystyle \pi_\beta(y) = \frac{e^{-\beta F(y)}}{Z(\beta)} = \frac{e^{-\beta F(y)}}{e^{-\beta F(x_0)} + \sum_{x \neq x_0} e^{-\beta F(x)} } = \frac{e^{-\beta (F(y) - F(x_0))}}{1 + \sum_{x \neq x_0} e^{-\beta (F(x) - F(x_0)} } $
$\displaystyle \lim_{\beta \to \infty} \pi_\beta(y) = \begin{cases} 1 & y = x_0\,,\\ 0 & y \neq x_0\,,\\ \end{cases} $
Tempering is an algorithm to sample from a density proportional to $e^{-F}$.
Simulated Annealing: Sample from $e^{-\beta F} / Z(\beta)$, and send $\beta \to \infty$.
For MALA / LMC, you can sample from a density proportional to $e^{-\beta F}$ by discretizing the SDE \begin{equation} dX_t = -\grad F(X_t) \, dt + \sqrt{ \frac{2}{\beta} } \, dW_t \end{equation}
Simulated Annealing: Fix $T$ (final time). Choose $\beta = \beta(t)$ so that $\beta(0)$ is very small, $\beta(T)$ is very large.
Tempering: Instead of choosing $\beta(T)$ to be very large, choose $\beta(T) = 1$.
Tempering delivers samples from $e^{-F}$, where as simulated annealing delivers the minima of$F$.
Tempering tries to avoid getting stuck in local minima.
Worst case of Tempering / LMC are equally bad; but in many practical situations, tempering helps.
Given $N$ points on the plane (cities).
Find a route that travels through each city exactly once, and returns to the starting point.
NP Hard problem.
Simulated Annealing was originally introduced to study this.
Better numerical numerical algorithms now using ant colonies.
Cost fits perfectly on a bell curve (shown in orange).
Minimum cost (in this example) is about $5.5$
Odds of finding it randomly $5.66 \cdot 10^{-32}$.
Will take 10,000 times the life of the universe on a fast computer.
Start at a random point. Travel to the nearest point that hasn’t yet been visited.
Performs badly in some situations. Gets within 25% of the minimum for most configurations
Start with the greedy nearest neighbor tour.
Randomly swap two cities (or cycle three) cities. If it makes the route shorter, use this route instead.
Repeat as long as you can afford.
Valley descent, where you’re minimizing the cost of a route.
Sample tours $\sigma$ from the distribution $e^{-\beta C(\sigma)}$ using Metropolis Hastings. ($C(\sigma) = $ cost of the tour $\sigma$.)
Proposal mechanism: Given a tour $\sigma$, randomly swap two cities (or cycle three cities).
Will sometimes accept longer tours.
Start with $\beta$ small, and send $\beta \to \infty$ along some schedule.
How you send $\beta \to \infty$ makes a difference.
Usually you want $\beta$ to increase faster as it becomes larger.
(Or you want $T = 1/\beta$ to decrease to $0$ exponentially.)
Don’t start with the greedy nearest neighbor route; start with a random one.
A valley descent (hill climb) works pretty well.
Simulated annealing works a bit better.
In high dimensional problems you usually only get good results if the cost function is “nice”.
Alice and Bob decide on a key:
A
with K
, replace B
with Z
and so on.Alice encodes her with this key, and sends it to Bob.
28 symbols in the alphabet. (Upper case letters, space, period)
Brute force key guess requires trying $10^{29}$ combinations.
Even with $10^{12}$ tries in a second, this is a few hundred million years!
Can crack (normal language) messages with some cleverness in seconds.
Normal language isn’t random strings. Some letter combinations occur more often than others.
Download a few large collections of books. (I used 5 from Project Gutenberg.)
Compute frequencies of ngrams:
How often do two letter combinations AA
, AB
, AC
etc. occur?
How often do three letter combinations AAA
, AAB
, AAC
etc. occur?
(Good enough to use 3 / 4 letter combinations. Don’t need to go too high.)
Use this to define a fitness function $F$:
Given any string of letters $S$, the fitness $F(S) \in \R$.
$F$ is high for natural language strings, and low for “random” strings.
Not telling you the exact definition (it’s for you to find on the homework).
$F$ | String |
---|---|
-396.925 | asdf wer b ewar pou bewp ear bewapoiu b awer wapeoi b asdlk;j wav asdlfk jwe bva |
-180.199 | Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor i |
-50.641 | A Markov chain or Markov process is a stochastic model describing a sequence of |
-47.801 | The Monte Carlo method uses random sampling to solve computational problems that |
Say your fitness function is $F$ and $\mathcal C$ is the coded message.
We will crack the message by finding the key $\sigma$ that maximizes $F( \operatorname{decode}( \mathcal C, \sigma ) )$.
Option 1: Stochastic Hill Climb:
Start with a random guess $\sigma$.
Change it slightly by swapping two symbols and generate a new key $\tau$.
That is, pick two letters randomly (say J
and R
).
If $\sigma$ replaced J
with A
and R
with T
, then $\tau$ will replace J
with T
and R
with A
.
If $F( \operatorname{decode}( C, \tau ) ) > F( \operatorname{decode}( C, \sigma ) )$, then replace $\sigma$ with $\tau$.
Repeat for as long as you can, or until the message looks readable.
Option 2: Use Simulated Annealing:
Both work because the function we’re maximizing is nice. Simulated annealing works a bit better.
N Iters | $F$ | Decoded message |
---|---|---|
0 | -470.33 | . R.SLDBERVW.LREPDS.SLD.GS.FEXFNS. LD.XRPFZO.G.LFAD.SDEMRP.LDFBM.LGP.PDZ GRZ.LDB |
100 | -374.97 | STFS OPUEFHJSOFEMP S OPSR SNEANB STOPSAFMNLCSRSONVPS PEIFMSOPNUISORMSMPLTRFLSOPU |
200 | -328.22 | RE SANUCEJZ AECMNS SAN IS PCGPHS RAN GEMPOB I APXN SNCTEM ANPUT AIM MNORIEO ANU |
300 | -301.30 | CO SANULOVF AOLMNS SAN IS ELHE.S CAN HOMERT I AEZN SNLGOM ANEUG AIM MNRCIOR ANU |
400 | -294.07 | CO SANULOVG AOLMNS SAN IS ELHE.S CAN HOMERK I AEQN SNLDOM ANEUD AIM MNRCIOR ANU |
800 | -222.94 | TO NSERLOCK SOLMEN NSE IN ALUAYN TSE UOMAF. I SAWE NELDOM SEARD SIM MEFTIOF SER |
1,300 | -34.58 | TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAZE SELDOM HEARD HIM MENTION HER |
3,500 | -24.30 | TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER |