Generate data
set.seed(42) # set seed for reproducibility
n <- 100 # number of samples
mu_true <- 2.0
sigma_true <- 1.0
y <- rnorm(n, mu_true, sigma_true)Anh Phan
April 24, 2026
In statistics, there are 2 major paradigms: Bayesian and frequentist’s statistics. The main difference between these 2 paradigms lies in the definition for probability.
In frequentist statistics, probability is the frequency of events as number of sample goes toward infinity (i.e. it is an unknown fixed value that we can estimate by collecting a lot of data).
In Bayesian statistics, probability is a degree of belief in the occurences of event, which can be updated as we observed more data.
A probability for an event in Bayesian definition can be represented as followed
\[ P(\theta'|D) = \frac{P(D|\theta) P(\theta)}{P(D)} \tag{1}\]
Where:
\(P(\theta)\) is the prior, since it represents our “prior belief” about the probability of event
\(P(D|\theta)\) is the likelihood, representing how likely it is to obtain the observed data under the assumption of our prior belief.
\(P(\theta'|D)\) is the posterior, since it represents our “updated belief” about the probability of event after observing some data \(D\).
This blog focuses on the process of parameter inference under Bayesian paradigm.
Supposed we have a set of observed data \(D\) and a parameter \(\theta\) we need to infer. Under the Bayesian paradigm, we can compute the probability distribution of \(\theta\) using the formula below
\[ P(\theta|D) = \frac{P(D|\theta) P(\theta)}{P(D)} \]
With a slightly different interpretation compared to Equation 1 as followed
\(P(\theta)\) is our prior assumption about the probability distribution of \(\theta\) before seeing the data.
\(P(D|\theta)\) is the likelihood, representing how likely it is to obtain the observed data under the assumed distribution for \(\theta\).
\(P(\theta|D)\) is the posterior, representing our updated belief about \(\theta\) after observing data \(D\).
The issue of this formulation lies in the need to compute \(P(D)\). To do so, we need to marginalize out the parameter space from the joint density, formulated as below
\[ P(D) = \int^\Theta P(D, \theta)d\theta = \int^\Theta P(D|\theta) P(\theta) d\theta \]
That is to say, we need integrate over all the possible values in the parameter space ( \(\Theta\) ). In most realistic models, it is impossible to derive this integral in closed form (with a caveat, see (note-conjugate-prior?)) or requires exponential time to compute.
To address this problem, there 2 main approaches proposed for Bayesian inference:
Monte-Carlo Markov Chain (MCMC) - a sampling based method
Variational Inference (VI) - an approximation based method
A common method to derive a closed form expression for a posterior is by using a conjugate prior.
The idea is selecting a functional form for the likelihood \(P(D|\theta)\) and the prior \(P(\theta)\), such that the prior \(P(\theta)\) and the resulting posterior \(P(\theta|D)\) belong to the same probability distribution family.
A break down of the name Monte Carlo Markov Chain
Monte Carlo refers to the process of sampling to estimate some value
Markov Chain refers to a chain of values where the value at each step depends on that of the previous step
Put into the context of parameter inference, MCMC refers to the process of deriving the distribution of parameters by sampling a chain of parameter values, where the next sampled value depend on the previous one.
Sampling process
The general flow of MCMC is as followed
To perform the sampling process, some questions we may encounter are:
How should we determine the parameter space?
How should we propose the new value? (i.e. What is the formulation for \(Q(\theta_{t-1})\)?)
How do we decide whether to accept the proposed parameter? (i.e. What is the formulation for \(A(\theta_t, \theta_{t-1})\)?)
How should we determine the parameter space?
The parameter space depends on the model or the problem context.
The prior distribution \(P(\theta)\) reflects our knowledge regarding how the values are distributed within that parameter space. If no assumption can be made, we can start from a non-informative distribution (i.e. a flat distribution such as \(\text{Beta}(1,1)\) ).
The choice of distribution for prior has been proven to have a big impact on the result as well as time till convergence.
One of the biggest critiques for Bayesian statistics is the need for a strong assumption for prior.
What is the formulation for \(Q(\theta_{t-1})\) and \(A(\theta_t, \theta_{t-1})\)?
The answer to these 2 questions both depend on the algorithm we decide to use. But the goal remains the same, and that is to select \(\theta_t\) such that we can maximise \(P(\theta_t|D)\).
Since \(P(D)\) is a constant, we can conclude that
\[ P(\theta_t | D) \propto P(D|\theta_t) P(\theta_t) \]
Which means that we can maximise \(P(D|\theta_t) P(\theta_t)\) instead.
Some popular algorithms include the following
Metropolis-Hastings
Hamilton Monte Carlo
No-U-Turn Sampler
For this blog, we follow the Metropolis-Hastings Random walk algorithm to demonstrate the process of MCMC.
The implementation is separated into the following sections:
Step 1: Data generation
For this example, the data is sampled from a normal distribution \(y \sim \text{Normal}(\mu = 2, \sigma = 1)\).
Step 2: Formulate the likelihood and prior
In general, there are 2 main components for a Bayesian model:
The model we think the data is generated from. From the model, we can derive the formulation for the likelihood (i.e., \(P(D|\theta)\)).
The prior distribution of the parameters of interest (i.e., \(P(\theta)\)).
In this example, we assume the data is generated from a Normal distribution, but we only know that the parameters lie in the range of 0-5. This leads to:
Model: \(y \sim \mathcal{N}(\mu, \sigma^2)\). And the likelihood is given by the normal distribution density function.
Parameters of interest
The mean \(\mu\), with a non-informative prior distribution \(\mu \sim \text{Uniform}(0,5)\)
The standard deviations \(\sigma\), with a non-informative prior distribution \(\sigma \sim \text{Uniform}(0,5)\)
# Note that here we are computing the log likelihood
# Generally speaking, it is often better to compute in log scale (to avoid small probabilities) and that is what we are doing here
likelihood <- function(y, mu_est, sigma_est) {
sum(dnorm(y, mean = mu_est, sd = sigma_est, log=TRUE), na.rm=TRUE)
}
# Note that here we are computing the log probability
p_prior <- function(mu, sigma, min=0, max=5) {
dunif(mu, min=min, max=max, log=TRUE) + dunif(sigma, min=min, max=max, log=TRUE)
}Step 3: Implement the Metropolis-Hastings Random Walk algorithm
Under Metropolis-Hastings Random Walk algorithm:
\(\theta_t\) is sampled from a Normal distribution i.e., \(Q(\theta_{t-1}) = \mathcal{N} (\theta_t, \sigma_{propose}^2)\) where \(\sigma_{propose}\) is a parameter for the sampling process.
Whether \(\theta_t\) is accepted is determined by a probability called “acceptance probability”, formulated as \(AP(\theta_t, \theta_{t-1}) = \frac{P(D|\theta_t)P(\theta_t)}{P(D|\theta_{t-1}) P(\theta_{t-1})}\), i.e., \(A(\theta_t, \theta_{t-1}) \sim \text{Bernoulli}(AP(\theta_t, \theta_{t-1}))\)
library(tidyverse)
library(patchwork)
# ------ Function to decide whether the proposal should be accepted ---------
accept_proposal <- function(
data,
curr_mu, proposed_mu,
curr_sigma, proposed_sigma,
likelihood, p_prior){
# compute (log of) acceptance ratio
acceptance_ratio <- likelihood(data, proposed_mu, proposed_sigma) + p_prior(proposed_mu, proposed_sigma) -
(likelihood(data, curr_mu, curr_sigma) + p_prior(curr_mu, curr_sigma))
accept_threshold <- runif(1, min = 0, max = 1)
exp(acceptance_ratio) > accept_threshold
}
# ------ Implement MH Random Walk algorithm ---------
random_walk_mh <- function(y,
iter=20000, burn_in = 1000, prop_sd = 0.6){
T_iter <- iter
burn_in <- burn_in
# standard deviations of the normal distribution to propose new data
prop_sd <- prop_sd
mu_est <- numeric(T_iter)
sigma_est <- numeric(T_iter)
# sample the first value for mu and sigma
mu_est[1] <- runif(1, min = 0, max = 5)
sigma_est[1] <- runif(1, min = 0, max = 5)
for (t in 2:T_iter) {
# sample new value for mu, sigma
curr_mu <- mu_est[t-1]
proposed_mu <- rnorm(1, mean = curr_mu, sd = prop_sd)
curr_sigma <- sigma_est[t-1]
proposed_sigma <- rnorm(1, mean = curr_sigma, sd = prop_sd)
if(accept_proposal(
y, curr_mu, proposed_mu,
curr_sigma, proposed_sigma,
likelihood, p_prior
)){
mu_est[t] <- proposed_mu
sigma_est[t] <- proposed_sigma
}else{
mu_est[t] <- curr_mu
sigma_est[t] <- curr_sigma
}
}
data.frame(
mu_posterior = mu_est,
sigma_posterior = sigma_est
) %>% mutate(
iter = 1:n(),
label = if_else(iter <= burn_in, "burn-in", "posterior")
)
}
mcmc_out <- suppressWarnings(random_walk_mh(y))Step 4: Visualize the output
For this example we will visualize 2 things:
The traceplot (i.e., samples for each parameter during each iteration)
The sampled posterior distribution for parameters
# -------- Traceplot ----------
mu_traceplot <- ggplot(mcmc_out, aes(x = iter, y = mu_posterior, color = label)) +
geom_line() +
scale_color_manual(values = c("grey70", "steelblue"),
labels = c("burn-in", "posterior")) +
labs(title = "Trace of mu (Metropolis–Hastings)",
x = "Iteration", y = expression(theta)) +
theme_minimal()
sigma_traceplot <- ggplot(mcmc_out, aes(x = iter, y = sigma_posterior, color = label)) +
geom_line() +
scale_color_manual(values = c("grey70", "steelblue"),
labels = c("burn-in", "posterior")) +
labs(title = "Trace of sigma (Metropolis–Hastings)",
x = "Iteration", y = expression(theta)) +
theme_minimal()
mu_traceplot / sigma_traceplot# -------- Histogram ----------
mu_samples <- mcmc_out %>%
filter(label == "posterior") %>%
ggplot() +
geom_histogram(aes(x = mu_posterior), bins=60,
fill = "steelblue", alpha = 0.5, color = "white") +
geom_vline(xintercept = mu_true, linetype = "dashed") +
labs(title = "Samples vs True value",
x = "mu", y = "Density") +
theme_minimal()
sigma_samples <- mcmc_out %>%
filter(label == "posterior") %>%
ggplot() +
geom_histogram(aes(x = sigma_posterior), bins=60,
fill = "steelblue", alpha = 0.5, color = "white") +
geom_vline(xintercept = sigma_true, linetype = "dashed") +
labs(title = "Samples vs True value",
x = "sigma", y = "Density") +
theme_minimal()
mu_samples / sigma_samples# ------ Compare estimated paramters from posterior with true parameters -------
mcmc_mu <- mcmc_out %>% filter(label == "posterior") %>% pull(mu_posterior)
mcmc_sigma <- mcmc_out %>% filter(label == "posterior") %>% pull(sigma_posterior)
cat(sprintf("MH mu ≈ %.3f | Actual mu = %.3f\n", median(mcmc_mu), mu_true))MH mu ≈ 2.030 | Actual mu = 2.000
MH sigma ≈ 1.058 | Actual sigma = 1.000
Several packages have been developed to facilitate the process of fitting a Bayesian model using MCMC. Some packages for MCMC in different programming languages is provided below.
R - mcmc
Python - PyMC
Estimate parameter for hierarchical model
pMCMC is used for estimating parameter for stochastic model
This section is still a work in progress, please proceed with caution and take the information with a (lots of) grain(s) of salt =))))
The following section is closely based on a previous work by Blei et al. (2017).
The intended purpose of this blog is to summarize the content and provide reproducible code/model implementations in R. For a more detailed explanation on the topic, please refer to the source material.
The general idea is to propose a family of variational distributions \(Q\) (i.e., a parameterization of a distribution of latent variables) then find the member distribution of that family \(q^*(\theta)\) that is “closest” to the posterior distribution. Thus, it is important to choose \(Q\) flexible enough to capture a density close to \(P(\theta| D)\) but simple enough for efficient optimization.
Wait, that’s a lot of assumption isn’t it?
One may notice one glaring issue with the proposed idea of Variational Inference: It only find a density \(q^*(\theta)\) close to the target \(P(\theta|D)\), as opposed to MCMC which is guaranteed to produce (asymptotically) exact samples from the target density.
So, why use VI at all?
The goal of VI is to address one of the biggest problem of MCMC: Scalability. In problems where dataset is large or more complex model, sampling (MCMC) may takes much longer to converge.
It is also worth pointing out that VI is known to generally underestimate the variance of the posterior density, which, depending on the problem at hand may or may not be acceptable.
Another key point to VI is that it is reframing the problem as an optimization problem, which allow it to be adaptable for Machine Learning/ deep learning framework.
The general flow of Variational Inference is as followed:
Compute the target function (ELBO) given current \(q(.)\)
Update the parameters of \(q(.)\) to optimize the target function
To perform the optimization process, we need to know the following:
How should we select the family of distribution \(Q\)?
What is the target function ELBO?
How should we update the parameters of \(q(.)\)?
What is the target function ELBO ?
As previously mentioned, our goal is to derive \(q^*(\theta)\) that is close to the target (posterior) distribution \(P(\theta|D)\). This so-called “closeness” between 2 distributions is quantified by computing KL divergence, \(q^*(\theta)\) can thus be formulated as followed.
\[ q^*(\theta) = \underset{q(\theta) \in Q }{\text{argmin}} \text{ KL}(q(\theta) || P(\theta|D)) \]
Where \(\text{KL}(q(\theta)||P(\theta|D)) = \mathbb{E}[log q(\theta)] - \mathbb{E}[logP(\theta|D)]\), with all expectations computed with respect to \(q(\theta)\). Expanding the conditional, we derive that \(\text{KL}(q(\theta)||P(\theta|D)) = \mathbb{E}[log q(\theta)] - \mathbb{E}[logP(\theta,D)] + logP(D)\).
At this point, you may notice an issue: this formulation also require knowing \(P(D)\), which is also a problem we discussed previously in (bayesian-inference?)
Instead, we can maximise another target function, referred to as the evidence lower bound (ELBO), formulated below.
\[ \text{ELBO}(q) = \mathbb{E}[\text{log}P(\theta, D)] - \mathbb{E}[\text{log}q(\theta)] \]
How should we select the family of distribution \(Q\)? How should we update \(q(.)\)?
The variational family \(Q\) generally depends on 2 factors
Assumption about the latent variables
The optimization method
The optimization methods used for VI (i.e., the algorithm to update \(q(.)\)) can be categorized into the following main groups
Closed form update - in which case conjugacy is required
Gradient based update - in which case \(q(.)\) is usually required to be differentiable
Okay? Can I get an example for the variational family \(Q\)?
Mean-field variational family
This is one of the most commonly used variational families. It can be used for both Closed form and Gradient based optimization. The assumption here is that the distributions for the latent variables are independent.
Given \(n\) latent variables
\[ q(\theta) = \prod_{i=1}^{n} q_i(\theta_i) \]
Where typical form of \(q_i(\theta_i)\) is often chosen from the exponential family of distributions, including: Gaussian, Gamma. Beta, Bernoulli, etc.
It is worth mentioning that generally speaking, variational family is NOT a model of the data (there is no \(D\) in the formulation). It is the process of minimizing ELBO that connects variational family to the data.
The implementation is separated into the following sections:
Step 1: Data generation
For this example, the data is generated from a mixture of \(k\) Gaussian distributions
We first generate the \(k\) means for each mixture from a common normal distribution: \(\mu_i \sim \mathcal{N}(0, \sigma_{prior}^2)\) where \(i = 1, 2, …, k\)
Then sample \(n\) data points as followed:
Assign each data point to a cluster \(c_j \sim \text{Categorical}(1/k, ..., 1/k)\) where \(j = 1, 2, …, n\)
Sample value for the data point: \(y_j \sim \mathcal{N}(\mu_i, 0)\) where \(j = 1, 2, …, n\)
set.seed(55) # set seed for reproducibility
n <- 1000 # number of samples
k <- 5 # set k, aka number of clusters
# sample the mean of a Gaussian mixtures
sd_prior <- 4
cluster_mus <- rnorm(k, mean=0, sd=sd_prior)
# sample which cluster each data points belong to
data_cluster <- sample(x = 1:k, size = n, replace = TRUE)
# finally, sample data points given the cluster assignment (data_cluster)
y <- rnorm(n, mean = cluster_mus[data_cluster], sd = 1)Visualize data points
Step 2: Formulate the model
Step 3: Formulate the variational family and ELBO
In this example, we will use previously mentioned mean-field variational family, which takes the following form, under the assumption that there are \(m\) latent variables.
\[ q(\theta) = \prod_{i=1}^{m} q_i(\theta_i) \]
The parameters that the data depend on are
The mean of each factor of the mixture \(\mu_i\) where \(i = 1, 2, …,k\)
The assignment of each data to a cluster \(c_j\) where \(j = 1,2,…, n\)
Thus, would need the following 2 types of the variational distribution:
\(q_i(\mu_i)\) which we formulate as a Gaussian distribution
\(q_j(c_j)\) which we formulate as a vector of \(k\) probabilities
The full variational distribution is then formulated as followed
\[ q() = \prod_{i = 1}^k q(\mu_i) \prod_{j = 1}^n q(c_j) \]
It is crucial to keep in mind the different types of parameters we are referring to, in the context of variational inference
Parameters refer to paramers of the model assumed to generate the data
Variational paramters refer to the parameters of the varational distribution
\[ \text{ELBO}(q) = \mathbb{E}[\text{log}P(\theta, D)] - \mathbb{E}[\text{log}q(\theta)] \]
# compute expected log probability of the estimated mean of cluster i
expected_log_likelihood <- function(y, q_cluster, m_k, sd_k){
# per data point likelihood
likelihood <- sapply(1:nrow(q_cluster), \(curr_row){
# per cluster likelihood term
sapply(1:ncol(q_cluster), \(curr_col){
q_cluster[curr_row, curr_col]*(
2*y[curr_row]*m_k[curr_col] - (m_k[curr_col]**2 + sd_k[curr_col]**2)/2
)
})
})
sum(likelihood)
}
expected_q_mu <- function(m_k, sd_k){
sum(dnorm(m_k, m_k, sd_k, log=TRUE))
}
expected_q_ci <- function(q_cluster){
sum(q_cluster*log(q_cluster))
}
compute_elbo <- function(y, q_cluster, m_k, sd_k, sigma_prior=1){
# Expected log p(mu), i.e. expected prior for mu term
sum(-(sd_k^2 + m_k^2)/(2*sigma_prior^2)) +
# compute the expected log p(c_i), i.e. expected prior for c term
sum(log(1/k)*length(y)) +
# likelihood
expected_log_likelihood(y, q_cluster, m_k, sd_k) -
# compute log q(m_k, sd_k), i.e. entropy term
expected_q_mu(m_k, sd_k) -
# compute log q(q_cluster), i.e. entropy term
expected_q_ci(q_cluster)
}Step 4: Implement the CAVI algorithm
cavi <- function(y, k = 5, max_iter = 30,
prior_sd = 1, threshold = 0.1){
# instantiate parameters for q(mu_i)
m_k <- rnorm(k)
sd_k <- rgamma(k, 5)
# instantiate q(c_j)
q_cluster <- matrix(1/k, nrow = length(y), ncol = k)
# keep track of ELBO to check convergence
elbo_trace <- numeric(length = max_iter+1)
elbo_trace <- compute_elbo(y, q_cluster, m_k, sd_k)
for (iter in 1:max_iter){
# loop through each data point
# update the q factor for the cluster assignment (i.e. q(c_i))
for (j in 1:length(y)){
# update q(c_j)
q_cluster[j,] <- exp(y[j]*m_k - 0.5*(sd_k^2+m_k^2))
# normalize rowwise
q_cluster[j,] <- q_cluster[j,]/sum(q_cluster[j,])
}
# print(q_cluster[1:5, 1:k])
# loop through each cluster
# update the q factor of that latent variable
new_sd <- rep(0, k)
new_m <- rep(0, k)
for (i in 1:k){
# update parameters for q(mu_i)
new_sd[i] <- 1/(1/prior_sd^2 + sum(q_cluster[,i]))
new_m[i] <- sum(q_cluster[,i] * y)/(1/prior_sd^2 + sum(q_cluster[,i]))
}
sd_k <- new_sd
m_k <- new_m
# compute elbo to check convergence
elbo_trace[iter+1] <- compute_elbo(y, q_cluster, m_k, sd_k)
# early termination if converged
if((elbo_trace[iter+1] - elbo_trace[iter]) <0.1 ) break
}
# return a list that include
# - data.frame of fitted variational parameters
# - data.frame of computed elbo per iteration
list(
variational_params = data.frame(
m = m_k,
sd_k = sd_k
),
elbos = data.frame(
iter = 0:(length(elbo_trace) - 1),
elbo = elbo_trace
)
)
}Visualize ELBOs
Compare variational parameters and real mus
m sd_k
1 0.3539948 0.005051128
2 -4.3303392 0.005333891
3 -7.0118296 0.004471144
4 0.3539947 0.005051128
5 0.4751592 0.005052748
Variational inference is widely applied to Machine Learning, especially Deep Learning to quantify uncertainty
Variational Auto Encoder
this stackoverflow discussion clearly explains VAE from a Bayesian perspective
also refer to this blog to see the relationship between varational inference and variational auto encoder
//TBD:
//TBD:
Frequentist approach (?):
Provide a framework to quantify uncertainty in deep learning
Variational inference
Stochastic gradient descent (SGD)
Monte-Carlo drop out
Laplace approximation
Deep ensembles
Variational Encoder as the variational family
//TBD: