Monte Carlo Markov Chain

1. Motivation

Statistical inference consists in learning about what we do not observe based on what we observe.

1.1. Monte Carlo Method

In its general form, Monte Carlo estimates the following expectation:

The only tricky issue is that the randomness involved is the pseudo-randomness of computer simulation, rather than randomness of real-world phenomena. Thus it is a good idea to use terminology that emphasizes the difference.

1.2. Bayesian Inference

Bayes rule:

The problem here is the integral in denominator, we can choose prior and likelihood distribution such that they are both conjugate. With conjugate distributions, we can come up with marginal likelihood distribution in closed form and consequently, the posterior can be computed. However, in most cases, the model is too complicated ⇒ intractable to compute the marginal likelihood in high dimensions ⇒ approximating with Monte Carlo and :

After approximating the posterior distribution of parameters

We can predict the probability of observing a new sample by marginalizing over :

This distribution can be known as marginal posterior distribution.

In cases such as when the model is simple and conjugate priors are being used the posterior and the above integral can be solved analytically.

Therefore, we can use Monte Carlo Method to approximate above probability by taking samples from posterior distribution which is:

1.3. Markov Chains

A sequence of random elements of some set is a Markov chain if the conditional distribution of given depends on only. The set in which the take values is called the state space of the Markov chain. A Markov chain has stationary transition probabilities if the conditional distribution of given does not depend on . This is the main kind of Markov chain of interest in MCMC. Some kinds of adaptive MCMC have non-stationary transition probabilities.

The joint distribution of a Markov chain is determined by:

The marginal distribution of , called the initial distribution
The conditional distribution called the transition probability distribution

If we define which is our state space and is discrete state space + cardinality of then we can totally define transition probability distribution as stochastic transition matrix as follow

Realize that and for all row .

The Markov chain might have a stationary distribution :

Thanks to Perron-Frobenius theorem which is stated as follow:

Let be a real matrix, with all its elements positive > 0. Then the top eigenvalue is unique and real (all other eigenvalues have a smaller real part). The corresponding top eigenvector has all its elements positive:

then the top eigenvalue satisfies the following inequalities:**

The transition matrix has eigen-value of 1 which can help us to find stationary probability distribution of Markov states . Specifically, If is a stochastic matrix of the Markov chain, we must have which makes . Then let’s which is our stationary distribution:

or with

In order to generalize the definition transition operator we denotes:

In MCMC, the Markov chain is usually a homogeneous Markov Chain which has transition functions are the same for all

If we use the transition operator, the definition of stationary (invariant) distribution becomes:

1.4. Properties of Markov Chains

To engineer a Markov chain we need some useful asymptotic properties of Markov chain

Definition of irreducible Markov chain:

We say that a Markov chain is irreducible if, for each and , there exists a (possibly depends on and ) such that:

verbally, a chain is irreducible if it is possible to eventually get from any state to any other state in finite number of steps.

Definition of aperiodic Markov chain:

An irreducible Markov chain is called aperiodic if its period is equal to 1, or equivalently

Theorem: All states of an irreducible MC are of the same type with regard to persistence (recurrence) and periodicity [5]

Theorem: Irreducible Markov chain has at least one non-null persistent Markov chain has a (unique) stationary distribution [7,8,9]

Theorem: A finite (at least one state is non-null persistent [9]), irreducible Markov chain has a unique stationary distribution [4]

Theorem: A finite, irreducible, aperiodic Markov chain has limiting distribution equals unique stationary distribution [6]

Theorem: A non-null persistent, irreducible, aperiodic Markov chain has limiting distribution equals unique stationary distribution

2. How To Sampling?

Since the only problem here is how to do sampling or . There are some standard methods and other intelligent, efficient methods.

2.1. Inverse Transform the Uniform Sampling

Inverse transform sampling is straightforward method of generating random samples from any distribution by using its inverse cumulative distribution function (ICDF) . Easy to note that

Source: https://en.wikipedia.org/wiki/Inverse_transform_sampling

By sampling uniform distributed random variable as a value of cumulative distribution function and inverse back that value, we can get the desired sample . Here is the proof that inverse uniform random variable will have the same CDF with desired distribution. Let’s , and

The intuition behind Inverse Transform Sampling is simple: re-scale the a uniform random variable to have probability that we want.

2.2. Acceptance-Rejection Sampling

Rejection sampling used a proxy proposal distribution with density function of to generate samples and with an acceptance-rejection criterion to filter out unwanted sample to get new sample with distribution as our desired distribution with density function of . This algorithms is described as follows:

Sample an observation from and an uniform distributed sample
Check whether . If the it’s true than accept this sample as our sample for distribution , else, reject it and return back to step 1.

The algorithm will take average iterations to get one sample for

Proof

We need to show that conditioned probability . Firstly we compute reverse conditional probability

Using Bayes theorem with the fact that

2.3. Markov Chain Monte Carlo

We need a Markov chain to have the following conditions:

Having the desired sampling distribution as Markov chain’s stationary distribution . The first condition can be solved by constraining the transition operator on a condition called as detailed balance condition
With any starting point, we can achieve the stationary distribution and visit all possible states. This condition can be known as Ergodicity.

These conditions guarantee that realizations of a single Markov chain to approximate the integrals without sampling from actual distribution . These conditions are proved by the following fundamental theorem

Fundamental Theorem To Design with Known Invariant Distribution [1]

If a homogeneous Markov chain on a finite state space with transition operator has as an invariant distribution and

then Markov chain is ergodic,

if is a real-valued function from the state space of the Markov chain, then the expectation of with respect
to the distribution written as , converges to its expectation with respect to , written
For all , regardless of the initial distribution , we got limiting distribution is our unique invariant distribution

The detailed balance condition is stated as follow: a transition operator satisfies the detailed balance condition if there exists a distribution such that (and, unique is our stationary distribution )

Therefore, we only need to find the transition operator satisfies

Remember this is our sufficient condition to design desired distribution to be stationary distribution since if the Markov stationary distribution is not unique one and stationary distribution is not limiting distribution, then detailed balance condition can not be useful. Fortunately, the ergodicity implies homogeneous Markov chain has unique stationary distribution (irreducible) and it’s also our limiting distribution (aperiodic).

2.4. Metropolis-Hastings Algorithm

Let is our desired distribution to sample from. We wish to have as our stationary distribution in MCMC. A proposal distribution will random walk point to other point with probability of then Metropolis-Hastings algorithm will accept the new sample point if that point yields high unnormalized term .
Metropolis-Hastings:

Sample from an arbitrary probability distribution
for do
1. Propose a random walk
2. Acceptance rate of new point is
3. Sample
4. If then assign new sample point by a random walk point else remains the old point
Collect points to create a of the distribution

We now replace with for easy manipulation with general original point. Notes that with the term , we do not necessarily compute exactly the density which is computationally intractable in the first place, therefore we replace it with unnormalized term without any difference due to the cancel between numerator and denumerator

We will prove that Metropolis-Hastings satisfies ergodicity conditions and detailed balance condition with desired probability is our stationary distribution. Firstly, the transition function of Metropolis-Hastings is

Since with the case of , there are two possible cases which are the proposal probability give new point equals exactly with and . Meanwhile, the proposal distribution give new point that is different from the original one but is rejected with probability of . For we have

For , we only need to prove which is obvious.

This property of Metropolis-Hastings aligned with (3), a detailed balance condition. Using theorem (2) with stationary distribution , we have

due to the proposal density is always greater than and with all sampling point . Indeed, assume is the first such that or , then there’s a proposal point is accepted with meaning can not be accepted.

3. References

[1] https://www.robots.ox.ac.uk/~fwood/teaching/C19_hilary_2015_2016/mcmc.pdf

[2] https://stats.stackexchange.com/questions/66945/does-a-mcmc-fulfilling-detailed-balance-yields-a-stationary-distribution

[3] https://www.youtube.com/watch?v=ZjrJpkD3o1w

[4] http://faculty.washington.edu/harin/markovchain_proofs.pdf

[5] https://www.ece.iastate.edu/~namrata/EE527_Spring08/l4c.pdf

[6] https://www.probabilitycourse.com/chapter11/11_2_6_stationary_and_limiting_distributions.php

[7] http://www.columbia.edu/~ks20/stochastic-I/stochastic-I-MCII.pdf

[8] https://www.commit.tu-berlin.de/fileadmin/fg302/FSP-markov-chains-smaple.pdf

[9] https://bookdown.org/jkang37/stochastic-process-lecture-notes/lecture09.html)