Variational Inference

1. Motivation

Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.

In all model-based statistical inference, the likelihood function is of central importance, since it expresses the probability of the observed data under a particular statistical model, and thus quantifies the support data lend to particular values of parameters and to choices among different models.

For simple models, an analytical formula for the likelihood function can typically be derived.

In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms
for sampling from a probability distribution, By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain.
However, for more complex models, an analytical formula might be elusive or the likelihood function might be computationally very costly to evaluate. The variational inference with the idea of optimizing alternative distribution to the true distribution has greatly contributed into solving this hard problem.

2. Variational Inference

Source: https://gregorygundersen.com/blog/2021/04/16/variational-inference/

2.1. Basic Theory

Before introduce the problem and variational inference as solution to that problem, we present some common concepts of information theory

Kullback-Leibler divergence

Source: https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#/media/File:KL-Gauss-Example.png

The KL-divergence, also known as the relative entropy, between two probability distribution density. Considering two distributions and having two densities and , normally, presents data or observations and represents a hypothesis or model of data. It is commonly used as a measure of similarity between densities.

In information theory, it can be interpreted as excess number of bits that using optimized code for samples from distribution to encode samples from distribution . This can be done due to the fact that entropy as average number of bits to encode samples of distribution and cross-entropy represents expected number of bits that using code from to encode .

Since we use the codebook from to encode , cross-entropy will be larger its entropy

For this reason we also have KL-divergence is non-negative

To prove that KL-divergence acts like a distance between two distributions, we might need convex property of KL-divergence. With 2 arbitrary densities and and becomes convex combination of these densities

By increasing we can make and we also have convex property of KL-divergence

Indeed, we have the proof with the fact that is a concave function.

KL-divergence is also asymmetric:

An important property of KL-divergence it only equals to 0 if and only if our .

Recall Bayesian inference in MCMC post, we calculate posterior distribution of parameters after updating the data with Bayes theorem . In this context, we replace and with latent variable governing the data distribution and observations variable . We have the formula

Since is intractable to compute, posterior distribution is also hard to evaluate. Meanwhile, MCMC use samples from Markov Chain to estimate the exact posterior, variational inference choose optimization direction.

In general, the VI aims to solve minimization problem

becomes our final approximation of posterior distribution . So the question is why don’t we use the forward KL-divergence ? The answer is forward KL is intractable to compute since it’s related with expectation over posterior distribution which is our desired to compute in the first place. Using reverse KL has its own drawbacks, this will be discussed more in section 2.2. Despite of avoiding expectation over posterior distribution, the reverse KL is still intractable, to be clear, we have following derivation

The term is called as evidence and it requires us to compute the denominator in (1) since . Fortunately, we are only considering as optimized variable, is a constant. Alternative function to be optimized is proposed, it called ELBO.

Evidence Lower Bound (ELBO)

Evidence lower bound is a function of so that optimizing is equivalent with optimizing ELBO. Specifically, we will maximize ELBO.

Remember that, only models the latent variable and not with the data. However, maximization process has already been incorporated with data information by introduce the log likelihood function into ELBO.

ELBO has its name due to the fact that since .

Finally this is the ultimate objective of Variational Inference, finding optimal to estimate posterior distribution.

2.2. Properties of Variational Inference

Suppose is true underlying distribution of data and is our model. Without lose of generalization, is a constant function and optimization process will refine into the form of . Recall that KL-divergence is asymmetric, optimizing forward KL-divergence will derive another optimal density function comparing with reverse KL-divergence . Specifically, reverse KL-divergence will give an underestimation of true distribution of data and focus more on mode-seeking behavior. This can be reasoned intuitively by this figure.

Source Tim Vieira's blog, figure by John Winn.

Source Tim Vieira’s blog, figure by John Winn.

2.3. Mean-field Variational Inference

We now introduce a popular variational family called mean-field variational family, where latent variables are mutually independent and each has its own density function with distinct external parameters governing the shape of that density function . These parameters can be called global parameters or variational parameters. A member of mean-field variational family can be described in the following formula:

Notes that we have not defined the parametric form of these densities functions since mean-field family only cares about the independence properties between latent variables and relaxes other constraints to balance the intractability and generality. With factorized assumptions, one can easily compute entropy of the approximating distribution by summation of all entropy of each latent variable distribution making it easier to alternatively optimization (will be discussed in section 2.4)

It easier to optimize any member of mean-field variational family, however, with the cost of approximation accuracy, especially, in the case of correlations between the hidden variables are not negligible. Notes that we omitted the variational parameters to simplify the equation as well as not lose the parametric or non-parametric generality of factorized density functions.

Source:http://www.it.uu.se/research/systems_and_control/education/2018/pml/lectures/lecture10_2x2.pdf

In the above figure, the red distribution which is created by a member of mean-field variational family, failed to capture the true distribution (in green) of the latent variables .

2.4. Coordinate Ascent VI algorithm (CAVI)

In the mean-field variational inference where the mean-field variational family is utilized, the most commonly used algorithm for solving the optimization problem described in equation (3) is coordinate ascent variational inference (CAVI). This algorithm is described in detail in (Bishop 2006). Like alternative optimization, CAVI iteratively optimizes each factor of the mean-field variational density , while fixing other factors. The optimization path of CAVI in the latent variables space like a step case.

Source:http://www.adeveloperdiary.com/data-science/machine-learning/introduction-to-coordinate-descent-using-least-squares-regression/

The main update formula of CAVI is based on the exponentiated expected log of joint distribution of latent variables and observed variables . Specifically, setting other variational factors fixed, we will allow the factorized density function of each latent variable proportional with exponentiated expected log of joint distribution of latent variables and observed variables .

where, denotes all other factors that have been fixed while optimizing factor and denotes the object we optimizing here is density function not . Later on, in the most case of parametric variational inference, we will directly optimize variational parameters such that will be proportional with the quantity described in equation (4).

Now we care how equation (4) can be can came up? Recall the optimization problem of VI . In calculus, we mostly care about optimizing over a set of variables, now we’re dealing with optimizing over a set of functions. The tool I will use to solve this optimization problem is Calculus of Variations. This method has been developed in till century by Leonhard Euler and many other phenomenal mathematicians. This field uses variations, which are small changes in functions to find maxima and minima of functionals: mappings from a set of functions to the real numbers. In this case, functional to be maximized is and set of functions is .

In the assumption of CAVI, we will fix other factors and find the optimal factor then functional will become

However since we are fixing , then optimizing functional can be reduced to

Now, calculus of derivation really interest one kind of functional that has closed form solution, and the solution is provided by equation called Euler-Lagrange equation. Considering a functional

By introducing a new definition of functional derivative when we consider how much a functional changes when we make a small change to the function

We can prove that with functional having the form in (6) (see more in Bishop 2006 [1] or my upcoming blog about calculus of variations and its applications in engineering)

Using this fact we can plug into equation (5) to solve for . Indeed, we have

In summary, we got final algorithm of coordinate ascent variational inference [2]

While the ELBO has not converged do

     for do
       Set
     end
     Compute
   end
return

We have introduced the motivation and basic theory of variational inference. In the following sections, we will continue with its applications with Variational Auto-encoder and how coordinate ascent variational inference (CAVI) is actually used in Gaussian Mixture Model.

To be continued.

[1] users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop - Pattern Recognition And Machine Learning - Springer 2006.pdf

[2] [1601.00670] Variational Inference: A Review for Statisticians (arxiv.org)