Variational Inference
1. Motivation
Approximate Bayesian computation (ABC) constitutes a class of computational methods rooted in Bayesian statistics that can be used to estimate the posterior distributions of model parameters.
In all model-based statistical inference, the likelihood function is of central importance, since it expresses the probability of the observed data under a particular statistical model, and thus quantifies the support data lend to particular values of parameters and to choices among different models.
-
For simple models, an analytical formula for the likelihood function can typically be derived.
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms
for sampling from a probability distribution, By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain. -
However, for more complex models, an analytical formula might be elusive or the likelihood function might be computationally very costly to evaluate. The variational inference with the idea of optimizing alternative distribution
to the true distribution has greatly contributed into solving this hard problem.
2. Variational Inference
Source: https://gregorygundersen.com/blog/2021/04/16/variational-inference/
2.1. Basic Theory
Before introduce the problem and variational inference as solution to that problem, we present some common concepts of information theory
Kullback-Leibler divergence
Source: https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#/media/File:KL-Gauss-Example.png
The KL-divergence, also known as the relative entropy, between two probability distribution density. Considering two distributions and
having two densities
and
, normally,
presents data or observations and
represents a hypothesis or model of data. It is commonly used as a measure of similarity between densities.
In information theory, it can be interpreted as excess number of bits that using optimized code for samples from distribution to encode samples from distribution
. This can be done due to the fact that entropy as average number of bits to encode samples of distribution and cross-entropy represents expected number of bits that using code from
to encode
.
Since we use the codebook from to encode
, cross-entropy will be larger its entropy
For this reason we also have KL-divergence is non-negative
To prove that KL-divergence acts like a distance between two distributions, we might need convex property of KL-divergence. With 2 arbitrary densities and
and
becomes convex combination of these densities
By increasing we can make
and we also have convex property of KL-divergence
Indeed, we have the proof with the fact that is a concave function.
KL-divergence is also asymmetric:
An important property of KL-divergence it only equals to 0 if and only if our .
Recall Bayesian inference in MCMC post, we calculate posterior distribution of parameters after updating the data
with Bayes theorem
. In this context, we replace
and
with latent variable governing the data distribution
and observations variable
. We have the formula
Since is intractable to compute, posterior distribution is also hard to evaluate. Meanwhile, MCMC use samples from Markov Chain to estimate the exact posterior, variational inference choose optimization direction.
In general, the VI aims to solve minimization problem
becomes our final approximation of posterior distribution
. So the question is why don’t we use the forward KL-divergence
? The answer is forward KL is intractable to compute since it’s related with expectation over posterior distribution which is our desired to compute in the first place. Using reverse KL has its own drawbacks, this will be discussed more in section 2.2. Despite of avoiding expectation over posterior distribution, the reverse KL is still intractable, to be clear, we have following derivation
The term is called as evidence and it requires us to compute the denominator in (1) since
. Fortunately, we are only considering
as optimized variable,
is a constant. Alternative function to be optimized is proposed, it called ELBO.
Evidence Lower Bound (ELBO)
Evidence lower bound is a function of so that optimizing
is equivalent with optimizing ELBO. Specifically, we will maximize ELBO.
Remember that, only models the latent variable
and not with the data. However, maximization process has already been incorporated with data information
by introduce the log likelihood function into ELBO.
ELBO has its name due to the fact that since
.
Finally this is the ultimate objective of Variational Inference, finding optimal to estimate posterior distribution.
2.2. Properties of Variational Inference
Suppose is true underlying distribution of data and
is our model. Without lose of generalization,
is a constant function and optimization process will refine
into the form of
. Recall that KL-divergence is asymmetric, optimizing forward KL-divergence
will derive another optimal density function comparing with reverse KL-divergence
. Specifically, reverse KL-divergence will give an underestimation of true distribution of data and focus more on mode-seeking behavior. This can be reasoned intuitively by this figure.
Source Tim Vieira’s blog, figure by John Winn.
2.3. Mean-field Variational Inference
We now introduce a popular variational family called mean-field variational family, where latent variables are mutually independent and each has its own density function
with distinct external parameters governing the shape of that density function
. These parameters can be called global parameters or variational parameters. A member of mean-field variational family can be described in the following formula:
Notes that we have not defined the parametric form of these densities functions since mean-field family only cares about the independence properties between latent variables and relaxes other constraints to balance the intractability and generality. With factorized assumptions, one can easily compute entropy of the approximating distribution by summation of all entropy of each latent variable distribution making it easier to alternatively optimization (will be discussed in section 2.4)
It easier to optimize any member of mean-field variational family, however, with the cost of approximation accuracy, especially, in the case of correlations between the hidden variables are not negligible. Notes that we omitted the variational parameters to simplify the equation as well as not lose the parametric or non-parametric generality of factorized density functions.
Source:http://www.it.uu.se/research/systems_and_control/education/2018/pml/lectures/lecture10_2x2.pdf
In the above figure, the red distribution which is created by a member of mean-field variational family, failed to capture the true distribution (in green) of the latent variables .
2.4. Coordinate Ascent VI algorithm (CAVI)
In the mean-field variational inference where the mean-field variational family is utilized, the most commonly used algorithm for solving the optimization problem described in equation (3) is coordinate ascent variational inference (CAVI). This algorithm is described in detail in (Bishop 2006). Like alternative optimization, CAVI iteratively optimizes each factor of the mean-field variational density
, while fixing other factors. The optimization path of CAVI in the latent variables space like a step case.
The main update formula of CAVI is based on the exponentiated expected log of joint distribution of latent variables and observed variables
. Specifically, setting other variational factors
fixed, we will allow the factorized density function of each latent variable
proportional with exponentiated expected log of joint distribution of latent variables
and observed variables
.
where, denotes all other factors that have been fixed while optimizing factor
and denotes the object we optimizing here is density function
not
. Later on, in the most case of parametric variational inference, we will directly optimize variational parameters
such that
will be proportional with the quantity described in equation (4).
Now we care how equation (4) can be can came up? Recall the optimization problem of VI . In calculus, we mostly care about optimizing over a set of variables, now we’re dealing with optimizing over a set of functions. The tool I will use to solve this optimization problem is Calculus of Variations. This method has been developed in
till
century by Leonhard Euler and many other phenomenal mathematicians. This field uses variations, which are small changes in functions to find maxima and minima of functionals: mappings from a set of functions to the real numbers. In this case, functional to be maximized is
and set of functions is
.
In the assumption of CAVI, we will fix other factors and find the optimal factor then functional will become
However since we are fixing , then optimizing functional can be reduced to
Now, calculus of derivation really interest one kind of functional that has closed form solution, and the solution is provided by equation called Euler-Lagrange equation. Considering a functional
By introducing a new definition of functional derivative when we consider how much a functional
changes when we make a small change
to the function
We can prove that with functional having the form in (6) (see more in Bishop 2006 [1] or my upcoming blog about calculus of variations and its applications in engineering)
Using this fact we can plug into equation (5) to solve for . Indeed, we have
In summary, we got final algorithm of coordinate ascent variational inference [2]
- While the ELBO has not converged do
for do
Set
end
Compute
end
return
We have introduced the motivation and basic theory of variational inference. In the following sections, we will continue with its applications with Variational Auto-encoder and how coordinate ascent variational inference (CAVI) is actually used in Gaussian Mixture Model.
To be continued.
[2] [1601.00670] Variational Inference: A Review for Statisticians (arxiv.org)