Bayesian inference

Overview of methods for Bayesian inference, and its potential application in deep probabilistic programming
Author

Anh Phan

Published

April 24, 2026

1 Background

In statistics, there are 2 major paradigms: Bayesian and frequentist’s statistics. The main difference between these 2 paradigms lies in the definition for probability.

  • In frequentist statistics, probability is the frequency of events as number of sample goes toward infinity (i.e. it is an unknown fixed value that we can estimate by collecting a lot of data).

    • A probability for an event in frequentist’s definition can be represented as followed
  • In Bayesian statistics, probability is a degree of belief in the occurences of event, which can be updated as we observed more data.

    • A probability for an event in Bayesian definition can be represented as followed

      \[ P(\theta'|D) = \frac{P(D|\theta) P(\theta)}{P(D)} \tag{1}\]

      Where:

      • \(P(\theta)\) is the prior, since it represents our “prior belief” about the probability of event

      • \(P(D|\theta)\) is the likelihood, representing how likely it is to obtain the observed data under the assumption of our prior belief.

      • \(P(\theta'|D)\) is the posterior, since it represents our “updated belief” about the probability of event after observing some data \(D\).

This blog focuses on the process of parameter inference under Bayesian paradigm.

1.1 Bayesian inference

Supposed we have a set of observed data \(D\) and a parameter \(\theta\) we need to infer. Under the Bayesian paradigm, we can compute the probability distribution of \(\theta\) using the formula below

\[ P(\theta|D) = \frac{P(D|\theta) P(\theta)}{P(D)} \]

With a slightly different interpretation compared to Equation 1 as followed

  • \(P(\theta)\) is our prior assumption about the probability distribution of \(\theta\) before seeing the data.

  • \(P(D|\theta)\) is the likelihood, representing how likely it is to obtain the observed data under the assumed distribution for \(\theta\).

  • \(P(\theta|D)\) is the posterior, representing our updated belief about \(\theta\) after observing data \(D\).

The issue of this formulation lies in the need to compute \(P(D)\). To do so, we need to marginalize out the parameter space from the joint density, formulated as below

\[ P(D) = \int^\Theta P(D, \theta)d\theta = \int^\Theta P(D|\theta) P(\theta) d\theta \]

That is to say, we need integrate over all the possible values in the parameter space ( \(\Theta\) ). In most realistic models, it is impossible to derive this integral in closed form (with a caveat, see (note-conjugate-prior?)) or requires exponential time to compute.

To address this problem, there 2 main approaches proposed for Bayesian inference:

  • Monte-Carlo Markov Chain (MCMC) - a sampling based method

  • Variational Inference (VI) - an approximation based method

A common method to derive a closed form expression for a posterior is by using a conjugate prior.

The idea is selecting a functional form for the likelihood \(P(D|\theta)\) and the prior \(P(\theta)\), such that the prior \(P(\theta)\) and the resulting posterior \(P(\theta|D)\) belong to the same probability distribution family.

2 MCMC

2.1 The general idea of MCMC

A break down of the name Monte Carlo Markov Chain

  • Monte Carlo refers to the process of sampling to estimate some value

  • Markov Chain refers to a chain of values where the value at each step depends on that of the previous step

Put into the context of parameter inference, MCMC refers to the process of deriving the distribution of parameters by sampling a chain of parameter values, where the next sampled value depend on the previous one.

2.2 The process of MCMC

Sampling process

The general flow of MCMC is as followed

  1. Initialization: Propose the initial value for parameter \(\theta_0\)
  2. From then on, repeat the following steps till convergence (or maximum iterations is reached):
    • Propose a new value for parameter \(\theta_{t}\) depending on the previously proposed value \(\theta_{t-1}\), formulated as \(\theta_{t} \sim Q(\theta_{t-1})\)
    • Decide whether to accept the new proposed value \(\theta_t\), formulated as \(A(\theta_t, \theta_{t-1})\).
      • If accept ( \(A(.)= 1\) ), add \(\theta_t\) to the chain
      • If reject ( \(A(.)= 0\) ), propose another value for \(\theta_t\)

To perform the sampling process, some questions we may encounter are:

  • How should we determine the parameter space?

  • How should we propose the new value? (i.e. What is the formulation for \(Q(\theta_{t-1})\)?)

  • How do we decide whether to accept the proposed parameter? (i.e. What is the formulation for \(A(\theta_t, \theta_{t-1})\)?)

How should we determine the parameter space?

The parameter space depends on the model or the problem context.

The prior distribution \(P(\theta)\) reflects our knowledge regarding how the values are distributed within that parameter space. If no assumption can be made, we can start from a non-informative distribution (i.e. a flat distribution such as \(\text{Beta}(1,1)\) ).

The choice of distribution for prior has been proven to have a big impact on the result as well as time till convergence.

One of the biggest critiques for Bayesian statistics is the need for a strong assumption for prior.

What is the formulation for \(Q(\theta_{t-1})\) and \(A(\theta_t, \theta_{t-1})\)?

The answer to these 2 questions both depend on the algorithm we decide to use. But the goal remains the same, and that is to select \(\theta_t\) such that we can maximise \(P(\theta_t|D)\).

Since \(P(D)\) is a constant, we can conclude that

\[ P(\theta_t | D) \propto P(D|\theta_t) P(\theta_t) \]

Which means that we can maximise \(P(D|\theta_t) P(\theta_t)\) instead.

Some popular algorithms include the following

  • Metropolis-Hastings

  • Hamilton Monte Carlo

  • No-U-Turn Sampler

2.3 Implementation for MCMC

2.3.1 Manual implementation in R

For this blog, we follow the Metropolis-Hastings Random walk algorithm to demonstrate the process of MCMC.

The implementation is separated into the following sections:

  1. Data generation
  2. Formulate likelihood and prior
  3. Implement the MH Random Walk algorithm
  4. Visualize the output

Step 1: Data generation

For this example, the data is sampled from a normal distribution \(y \sim \text{Normal}(\mu = 2, \sigma = 1)\).

Generate data
set.seed(42) # set seed for reproducibility
n <- 100 # number of samples

mu_true <- 2.0
sigma_true <- 1.0

y <- rnorm(n, mu_true, sigma_true)

Step 2: Formulate the likelihood and prior

In general, there are 2 main components for a Bayesian model:

  • The model we think the data is generated from. From the model, we can derive the formulation for the likelihood (i.e., \(P(D|\theta)\)).

  • The prior distribution of the parameters of interest (i.e., \(P(\theta)\)).

In this example, we assume the data is generated from a Normal distribution, but we only know that the parameters lie in the range of 0-5. This leads to:

  • Model: \(y \sim \mathcal{N}(\mu, \sigma^2)\). And the likelihood is given by the normal distribution density function.

  • Parameters of interest

    • The mean \(\mu\), with a non-informative prior distribution \(\mu \sim \text{Uniform}(0,5)\)

    • The standard deviations \(\sigma\), with a non-informative prior distribution \(\sigma \sim \text{Uniform}(0,5)\)

Functions to compute likelihood and prior
# Note that here we are computing the log likelihood
# Generally speaking, it is often better to compute in log scale (to avoid small probabilities) and that is what we are doing here
likelihood  <- function(y, mu_est, sigma_est) {
  sum(dnorm(y, mean = mu_est, sd = sigma_est, log=TRUE), na.rm=TRUE)
}

# Note that here we are computing the log probability
p_prior <- function(mu, sigma, min=0, max=5) {
  dunif(mu, min=min, max=max, log=TRUE) + dunif(sigma, min=min, max=max, log=TRUE)
}

Step 3: Implement the Metropolis-Hastings Random Walk algorithm

Under Metropolis-Hastings Random Walk algorithm:

  • \(\theta_t\) is sampled from a Normal distribution i.e., \(Q(\theta_{t-1}) = \mathcal{N} (\theta_t, \sigma_{propose}^2)\) where \(\sigma_{propose}\) is a parameter for the sampling process.

  • Whether \(\theta_t\) is accepted is determined by a probability called “acceptance probability”, formulated as \(AP(\theta_t, \theta_{t-1}) = \frac{P(D|\theta_t)P(\theta_t)}{P(D|\theta_{t-1}) P(\theta_{t-1})}\), i.e., \(A(\theta_t, \theta_{t-1}) \sim \text{Bernoulli}(AP(\theta_t, \theta_{t-1}))\)

Code for MH Random Walk algorithm
library(tidyverse)
library(patchwork)

# ------ Function to decide whether the proposal should be accepted --------- 
accept_proposal <- function(
    data,
    curr_mu, proposed_mu,
    curr_sigma, proposed_sigma,
    likelihood, p_prior){
  
  # compute (log of) acceptance ratio
  acceptance_ratio <- likelihood(data, proposed_mu, proposed_sigma) + p_prior(proposed_mu, proposed_sigma) - 
    (likelihood(data, curr_mu, curr_sigma) + p_prior(curr_mu, curr_sigma))
  
  accept_threshold <- runif(1, min = 0, max = 1)
  
  exp(acceptance_ratio) > accept_threshold
}

# ------ Implement MH Random Walk algorithm --------- 
random_walk_mh <- function(y, 
                           iter=20000, burn_in = 1000, prop_sd = 0.6){
  T_iter   <- iter
  burn_in  <- burn_in
  # standard deviations of the normal distribution to propose new data
  prop_sd  <- prop_sd 
  
  mu_est <- numeric(T_iter)
  sigma_est <- numeric(T_iter)
  
  # sample the first value for mu and sigma
  mu_est[1] <- runif(1, min = 0, max = 5)
  sigma_est[1] <- runif(1, min = 0, max = 5)
  
  
  for (t in 2:T_iter) {
    # sample new value for mu, sigma
    curr_mu <- mu_est[t-1]
    proposed_mu <- rnorm(1, mean = curr_mu, sd = prop_sd)
    
    curr_sigma <- sigma_est[t-1]
    proposed_sigma <- rnorm(1, mean = curr_sigma, sd = prop_sd)
    
    if(accept_proposal(
      y, curr_mu, proposed_mu,
      curr_sigma, proposed_sigma,
      likelihood, p_prior
    )){
      mu_est[t] <- proposed_mu
      sigma_est[t] <- proposed_sigma
    }else{
      mu_est[t] <- curr_mu
      sigma_est[t] <- curr_sigma
    }
  }
  
  data.frame(
    mu_posterior = mu_est,
    sigma_posterior = sigma_est
  ) %>% mutate(
    iter = 1:n(),
    label = if_else(iter <= burn_in, "burn-in", "posterior")
  )
}

mcmc_out <- suppressWarnings(random_walk_mh(y))

Step 4: Visualize the output

For this example we will visualize 2 things:

  • The traceplot (i.e., samples for each parameter during each iteration)

  • The sampled posterior distribution for parameters

Code for visualizing MCMC output
# -------- Traceplot ---------- 
mu_traceplot <- ggplot(mcmc_out, aes(x = iter, y = mu_posterior, color = label)) +
  geom_line() +
  scale_color_manual(values = c("grey70", "steelblue"),
                     labels = c("burn-in", "posterior")) +
  labs(title = "Trace of mu (Metropolis–Hastings)",
       x = "Iteration", y = expression(theta)) +
  theme_minimal()

sigma_traceplot <- ggplot(mcmc_out, aes(x = iter, y = sigma_posterior, color = label)) +
  geom_line() +
  scale_color_manual(values = c("grey70", "steelblue"),
                     labels = c("burn-in", "posterior")) +
  labs(title = "Trace of sigma (Metropolis–Hastings)",
       x = "Iteration", y = expression(theta)) +
  theme_minimal()

mu_traceplot / sigma_traceplot

Code for visualizing MCMC output
# -------- Histogram ----------
mu_samples <- mcmc_out %>% 
  filter(label == "posterior") %>% 
  ggplot() +
  geom_histogram(aes(x = mu_posterior), bins=60,
                 fill = "steelblue", alpha = 0.5, color = "white") +
  geom_vline(xintercept = mu_true, linetype = "dashed") +
  labs(title = "Samples vs True value",
       x = "mu", y = "Density") +
  theme_minimal()

sigma_samples <- mcmc_out %>% 
  filter(label == "posterior") %>% 
  ggplot() +
  geom_histogram(aes(x = sigma_posterior), bins=60,
                 fill = "steelblue", alpha = 0.5, color = "white") +
  geom_vline(xintercept = sigma_true, linetype = "dashed") +
  labs(title = "Samples vs True value",
       x = "sigma", y = "Density") +
  theme_minimal()

mu_samples / sigma_samples

Code for visualizing MCMC output
# ------ Compare estimated paramters from posterior with true parameters -------
mcmc_mu <- mcmc_out %>% filter(label == "posterior") %>% pull(mu_posterior)
mcmc_sigma <- mcmc_out %>% filter(label == "posterior") %>% pull(sigma_posterior)

cat(sprintf("MH mu ≈ %.3f  |  Actual mu = %.3f\n", median(mcmc_mu), mu_true))
MH mu ≈ 2.030  |  Actual mu = 2.000
Code for visualizing MCMC output
cat(sprintf("MH sigma ≈ %.3f  |  Actual sigma   = %.3f\n", median(mcmc_sigma), sigma_true))
MH sigma ≈ 1.058  |  Actual sigma   = 1.000

2.4 Some extensions and applications of MCMC

  • Estimate parameter for hierarchical model

  • pMCMC is used for estimating parameter for stochastic model

3 Variational inference

CautionWARNING

This section is still a work in progress, please proceed with caution and take the information with a (lots of) grain(s) of salt =))))

WarningDisclaimer

The following section is closely based on a previous work by Blei et al. (2017).

The intended purpose of this blog is to summarize the content and provide reproducible code/model implementations in R. For a more detailed explanation on the topic, please refer to the source material.

3.1 The general idea of VI

The general idea is to propose a family of variational distributions \(Q\) (i.e., a parameterization of a distribution of latent variables) then find the member distribution of that family \(q^*(\theta)\) that is “closest” to the posterior distribution. Thus, it is important to choose \(Q\) flexible enough to capture a density close to \(P(\theta| D)\) but simple enough for efficient optimization.

Wait, that’s a lot of assumption isn’t it?

One may notice one glaring issue with the proposed idea of Variational Inference: It only find a density \(q^*(\theta)\) close to the target \(P(\theta|D)\), as opposed to MCMC which is guaranteed to produce (asymptotically) exact samples from the target density.

So, why use VI at all?

The goal of VI is to address one of the biggest problem of MCMC: Scalability. In problems where dataset is large or more complex model, sampling (MCMC) may takes much longer to converge.

It is also worth pointing out that VI is known to generally underestimate the variance of the posterior density, which, depending on the problem at hand may or may not be acceptable.

Another key point to VI is that it is reframing the problem as an optimization problem, which allow it to be adaptable for Machine Learning/ deep learning framework.

3.2 The process of Variational Inference

The general flow of Variational Inference is as followed:

  1. Select a family of variational distribution \(Q\)
  2. Initialization: propose a set of variational parameters for the variational distribution \(q(.)\)
  3. From then on, repeat the following steps till convergence (or maximum iterations is reached):
    • Compute the target function (ELBO) given current \(q(.)\)

    • Update the parameters of \(q(.)\) to optimize the target function

To perform the optimization process, we need to know the following:

  • How should we select the family of distribution \(Q\)?

  • What is the target function ELBO?

  • How should we update the parameters of \(q(.)\)?

What is the target function ELBO ?

As previously mentioned, our goal is to derive \(q^*(\theta)\) that is close to the target (posterior) distribution \(P(\theta|D)\). This so-called “closeness” between 2 distributions is quantified by computing KL divergence, \(q^*(\theta)\) can thus be formulated as followed.

\[ q^*(\theta) = \underset{q(\theta) \in Q }{\text{argmin}} \text{ KL}(q(\theta) || P(\theta|D)) \]

Where \(\text{KL}(q(\theta)||P(\theta|D)) = \mathbb{E}[log q(\theta)] - \mathbb{E}[logP(\theta|D)]\), with all expectations computed with respect to \(q(\theta)\). Expanding the conditional, we derive that \(\text{KL}(q(\theta)||P(\theta|D)) = \mathbb{E}[log q(\theta)] - \mathbb{E}[logP(\theta,D)] + logP(D)\).

At this point, you may notice an issue: this formulation also require knowing \(P(D)\), which is also a problem we discussed previously in (bayesian-inference?)

Instead, we can maximise another target function, referred to as the evidence lower bound (ELBO), formulated below.

\[ \text{ELBO}(q) = \mathbb{E}[\text{log}P(\theta, D)] - \mathbb{E}[\text{log}q(\theta)] \]

How should we select the family of distribution \(Q\)? How should we update \(q(.)\)?

The variational family \(Q\) generally depends on 2 factors

  • Assumption about the latent variables

  • The optimization method

The optimization methods used for VI (i.e., the algorithm to update \(q(.)\)) can be categorized into the following main groups

  • Closed form update - in which case conjugacy is required

    • Algorithm(s): Coordinate ascent variational inference (CAVI)
  • Gradient based update - in which case \(q(.)\) is usually required to be differentiable

    • Algorithm(s): Stochastic variational inference (SVI)

Okay? Can I get an example for the variational family \(Q\)?

Mean-field variational family

This is one of the most commonly used variational families. It can be used for both Closed form and Gradient based optimization. The assumption here is that the distributions for the latent variables are independent.

Given \(n\) latent variables

\[ q(\theta) = \prod_{i=1}^{n} q_i(\theta_i) \]

Where typical form of \(q_i(\theta_i)\) is often chosen from the exponential family of distributions, including: Gaussian, Gamma. Beta, Bernoulli, etc.

It is worth mentioning that generally speaking, variational family is NOT a model of the data (there is no \(D\) in the formulation). It is the process of minimizing ELBO that connects variational family to the data.

3.3 Implementation for Variational Inference

3.3.1 Manual implementation in R

The implementation is separated into the following sections:

  1. Data generation
  2. Formulate the model
  3. Formulate the variational family and ELBO
  4. Implement the CAVI algorithm
  5. Visualize the output

Step 1: Data generation

For this example, the data is generated from a mixture of \(k\) Gaussian distributions

  • We first generate the \(k\) means for each mixture from a common normal distribution: \(\mu_i \sim \mathcal{N}(0, \sigma_{prior}^2)\) where \(i = 1, 2, …, k\)

  • Then sample \(n\) data points as followed:

    • Assign each data point to a cluster \(c_j \sim \text{Categorical}(1/k, ..., 1/k)\) where \(j = 1, 2, …, n\)

    • Sample value for the data point: \(y_j \sim \mathcal{N}(\mu_i, 0)\) where \(j = 1, 2, …, n\)

Simulate data
set.seed(55) # set seed for reproducibility
n <- 1000 # number of samples
k <- 5 # set k, aka number of clusters

# sample the mean of a Gaussian mixtures
sd_prior <- 4
cluster_mus <- rnorm(k, mean=0, sd=sd_prior)

# sample which cluster each data points belong to 
data_cluster <- sample(x = 1:k, size = n, replace = TRUE)

# finally, sample data points given the cluster assignment (data_cluster)
y <- rnorm(n, mean = cluster_mus[data_cluster], sd = 1)

Visualize data points

Code for plotting
ggplot(
  data.frame(
    y = y,
    data_cluster = as.factor(data_cluster)
  )
) +
  geom_jitter(
    aes(
      x = data_cluster, y = y, color = data_cluster
    )
  )

Code for plotting
ggplot(
  data.frame(
    y = y,
    data_cluster = as.factor(data_cluster)
  )
) +
  geom_density(
    aes(
      x = y, color = data_cluster
    )
  )

Step 2: Formulate the model

Functions to compute log priors
log_mu_prior <- function(mu, m_k, sd_k){
  dnorm(mu, mean = m_k, sd = sd_k, log = TRUE)
}

log_c_prior <- function(y, k){
  log(1/k)*length(y)
}

Step 3: Formulate the variational family and ELBO

In this example, we will use previously mentioned mean-field variational family, which takes the following form, under the assumption that there are \(m\) latent variables.

\[ q(\theta) = \prod_{i=1}^{m} q_i(\theta_i) \]

The parameters that the data depend on are

  • The mean of each factor of the mixture \(\mu_i\) where \(i = 1, 2, …,k\)

  • The assignment of each data to a cluster \(c_j\) where \(j = 1,2,…, n\)

Thus, would need the following 2 types of the variational distribution:

  • \(q_i(\mu_i)\) which we formulate as a Gaussian distribution

  • \(q_j(c_j)\) which we formulate as a vector of \(k\) probabilities

The full variational distribution is then formulated as followed

\[ q() = \prod_{i = 1}^k q(\mu_i) \prod_{j = 1}^n q(c_j) \]

It is crucial to keep in mind the different types of parameters we are referring to, in the context of variational inference

  • Parameters refer to paramers of the model assumed to generate the data

    • In this example, it’s \(\mu_{1:k}\) and \(c_{1:c}\).
  • Variational paramters refer to the parameters of the varational distribution

    • In this example, it’s the parameter of each variational factor \(q(\mu_i)\) and \(q(c_j)\)

\[ \text{ELBO}(q) = \mathbb{E}[\text{log}P(\theta, D)] - \mathbb{E}[\text{log}q(\theta)] \]

Functions to compute the expected values
# compute expected log probability of the estimated mean of cluster i
expected_log_likelihood <- function(y, q_cluster, m_k, sd_k){
  
  # per data point likelihood
  likelihood <- sapply(1:nrow(q_cluster), \(curr_row){
    
    # per cluster likelihood term
    sapply(1:ncol(q_cluster), \(curr_col){
      q_cluster[curr_row, curr_col]*(
        2*y[curr_row]*m_k[curr_col] - (m_k[curr_col]**2 + sd_k[curr_col]**2)/2
      )
    })

  })
  
  sum(likelihood)
}

expected_q_mu <- function(m_k, sd_k){
  sum(dnorm(m_k, m_k, sd_k, log=TRUE))
}

expected_q_ci <- function(q_cluster){
  sum(q_cluster*log(q_cluster))
}


compute_elbo <- function(y, q_cluster, m_k, sd_k, sigma_prior=1){
  # Expected log p(mu), i.e. expected prior for mu term
  sum(-(sd_k^2 + m_k^2)/(2*sigma_prior^2)) +
  # compute the expected log p(c_i), i.e. expected prior for c term
  sum(log(1/k)*length(y)) +
  # likelihood
  expected_log_likelihood(y, q_cluster, m_k, sd_k) -
  # compute log q(m_k, sd_k), i.e. entropy term 
  expected_q_mu(m_k, sd_k)  - 
  # compute log q(q_cluster), i.e. entropy term
  expected_q_ci(q_cluster)
}

Step 4: Implement the CAVI algorithm

Function to implement CAVI
cavi <- function(y, k = 5, max_iter = 30, 
                 prior_sd = 1, threshold = 0.1){
  
  # instantiate parameters for q(mu_i) 
  m_k <- rnorm(k)
  sd_k <- rgamma(k, 5)
  
  # instantiate q(c_j)
  q_cluster <- matrix(1/k, nrow = length(y), ncol = k) 
  
  # keep track of ELBO to check convergence
  elbo_trace <- numeric(length = max_iter+1)
  elbo_trace <- compute_elbo(y, q_cluster, m_k, sd_k)
  
  for (iter in 1:max_iter){
    
    # loop through each data point 
    # update the q factor for the cluster assignment (i.e. q(c_i))
    for (j in 1:length(y)){
      # update q(c_j)
      q_cluster[j,] <- exp(y[j]*m_k - 0.5*(sd_k^2+m_k^2))
      
      # normalize rowwise
      q_cluster[j,] <- q_cluster[j,]/sum(q_cluster[j,])
    }
    
    # print(q_cluster[1:5, 1:k])
    
    # loop through each cluster
    # update the q factor of that latent variable 
    new_sd <- rep(0, k)
    new_m <- rep(0, k)
    for (i in 1:k){
      # update parameters for q(mu_i)
      new_sd[i] <- 1/(1/prior_sd^2 + sum(q_cluster[,i]))
      new_m[i] <- sum(q_cluster[,i] * y)/(1/prior_sd^2 + sum(q_cluster[,i]))
    }
    sd_k <- new_sd
    m_k <- new_m
    
    # compute elbo to check convergence
    elbo_trace[iter+1] <- compute_elbo(y, q_cluster, m_k, sd_k)
    
    # early termination if converged
    if((elbo_trace[iter+1] - elbo_trace[iter]) <0.1 ) break
  }
  
  # return a list that include
  # - data.frame of fitted variational parameters
  # - data.frame of computed elbo per iteration
  list(
    variational_params = data.frame(
      m = m_k,
      sd_k = sd_k
    ),
    elbos = data.frame(
      iter = 0:(length(elbo_trace) - 1),
      elbo = elbo_trace
    )
  )
}

Visualize ELBOs

Code for plotting ELBOs
cavi_out <- cavi(y)

ggplot(cavi_out$elbos) +
  geom_line(
    aes(
      x = iter,
      y = elbo
    ),
    color = "cornflowerblue"
  )

Compare variational parameters and real mus

Code for plotting
cavi_out$variational_params
           m        sd_k
1  0.3539948 0.005051128
2 -4.3303392 0.005333891
3 -7.0118296 0.004471144
4  0.3539947 0.005051128
5  0.4751592 0.005052748
Code for plotting
ggplot(
  data.frame(
    y = y,
    data_cluster = as.factor(data_cluster)
  )
) +
  geom_density(
    aes(
      x = y, color = data_cluster
    )
  ) +
  geom_vline(
    aes(
      xintercept = m
    ),
    color = "red",
    linetype = "dashed",
    data = cavi_out$variational_params
  )

3.4 Applications of Variational Inference

Variational inference is widely applied to Machine Learning, especially Deep Learning to quantify uncertainty

  • Variational Auto Encoder

    • this stackoverflow discussion clearly explains VAE from a Bayesian perspective

    • also refer to this blog to see the relationship between varational inference and variational auto encoder

//TBD:

4 Deep Probabilistic Programming

//TBD:

Frequentist approach (?):

  • Conformal prediction

4.1 So, why consider Bayesian statistics at all?

Provide a framework to quantify uncertainty in deep learning

4.2 How Bayesian inference is adapted for deep learning

Variational inference

  • Stochastic gradient descent (SGD)

  • Monte-Carlo drop out

  • Laplace approximation

  • Deep ensembles

  • Variational Encoder as the variational family

4.4 Application of probabilistic deep learning

//TBD:

References

Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112 (518): 859–77. https://doi.org/10.1080/01621459.2017.1285773.