12  Samplers and inference

library(monty)

Many quantities of interest in uncertain systems can be expressed as integrals that weight outcomes according to their probability. A common approach to computing these integrals is Monte Carlo estimation, where we draw samples from our distribution of interest and take their mean as an approximation. This method has been central to probabilistic inference and has driven the development of sophisticated sampling algorithms since the advent of modern computing in the 1950s.

The monty package supports several sampling algorithms designed to handle a variety of target distributions and leverage any available information (such as gradients). At the moment, monty provides the following samplers:

  1. Random-Walk Metropolis-Hastings (RWMH) – a simple and robust MCMC sampler which does not require gradient information.
  2. Adaptive Metropolis-Hastings – an extension of RWMH that can adapt its proposal distribution on the fly, improving efficiency over time.
  3. Hamiltonian Monte Carlo (HMC) – a gradient-based sampler that can drastically reduce autocorrelation in the chain, particularly for high-dimensional or strongly correlated targets.
  4. Parallel Tempering – a technique for tackling multimodal or complex distributions by running multiple “tempered” chains and exchanging states between them.
  5. Nested Models – an interface for combining samplers or applying hierarchical structures (though typically not a “sampler” in the strict sense, monty provides mechanisms to nest or combine models in more complex ways).

These samplers can all be accessed via constructors (e.g. monty_sampler_random_walk(), monty_sampler_hmc(), etc.) and used with the general-purpose function monty_sample(). By choosing the sampler that best suits your distribution - whether it is unimodal or multimodal, gradient-accessible or not - you can often achieve more efficient exploration of your parameter space.

12.1 Sampling without monty

In this section we are going to see an example where we can sample from a monty model without using monty. The idea is then to compare this ideal situation with less favourable cases when we have to use Monte Carlo methods to sample.

Imagine that we have a simple 2D (bivariate) Gaussian monty_model model with some positive correlation:

# Set up a simple 2D Gaussian model, with correlation
VCV <- matrix(c(1, 0.8, 0.8, 1), 2, 2)
m <- monty_example("gaussian", VCV)
m
#> 
#> ── <monty_model> ───────────────────────────────────────────────────────────────
#> ℹ Model has 2 parameters: 'a' and 'b'
#> ℹ This model:
#> • can compute gradients
#> • can be directly sampled from
#> ℹ See `?monty_model()` for more information

This monty_model is differentiable and can be directly sampled from (a very low level way of “sampling” that means different things depending on the situation e.g. could mean sampling from the prior distribution, we advise to use it very carefully).

In that particularly simple case, we can even visualise its density over a grid:

a <- seq(-4, 4, length.out = 1000)
b <- seq(-3, 3, length.out = 1000)

z <- matrix(1,
            nrow = length(a),
            ncol = length(b))
for(i in seq_along(a))
  for(j in seq_along(b))
    z[i,j] <- exp(monty_model_density(m, c(a[i], b[j])))
image(a, b, z, xlab = "a", ylab = "b",
      main = "2D Gaussian model with correlation")

As we are dealing with a bivariate normal distribution, we can use a trick to sample easily from this distribution. We use samples from a simple standard normal distribution and multiply them by the Cholesky decomposition of the Variance-Covariance parameter matrix of our bivariate normal distribution:

n_samples <- 1000
standard_samples <- matrix(rnorm(2 * n_samples), ncol = 2)
samples <- standard_samples %*% chol(VCV)

These samples align well with the density:

image(a, b, z, xlab = "a", ylab = "b",
      main = "2D Gaussian model with samples")
points(samples, pch = 19, col = "#00000055")

They are i.i.d. and present no sign of autocorrelation, which can be visualised using the acf() function - successive samples (i.e. lagged by one) in this case do not present any sign of correlation.

acf(samples[, 1],
    main = "Autocorrelation of i.i.d. samples")

12.2 Sampling with monty

However most of the time, in practical situations such as Bayesian modelling, we are not able to sample using simple functions, and we have to use an MCMC sampler. monty samplers exploit the properties of the underlying monty model (e.g. availability of gradient) to draw samples. While they differ, a key commonality is that they are based on a chain of Monte Carlo draws and are thus characterised by their number of steps in the chain, n_steps. Also, they are built using a constructor of the form monty_sampler_name() and then samples are generated using the monty_sample() function.

12.2.1 Randow-Walk Metropolis-Hastings

Random-Walk Metropolis-Hastings (RWMH) is one of the most straightforward Markov chain Monte Carlo (MCMC) algorithms. At each iteration, RWMH proposes a new parameter vector by taking a random step - usually drawn from a symmetric distribution (e.g. multivariate normal with mean zero) - from the current parameter values. This new proposal is then accepted or rejected based on the Metropolis acceptance rule, which compares the density of the proposed point with that of the current point.

The random-walk nature of the proposal means that tuning the step size and directionality (defined by a variance-covariance matrix in multiple dimensions) is crucial. If steps are too large, many proposals will be rejected; if they are too small, the chain will move slowly through parameter space, leading to high autocorrelation in the samples. RWMH can be a good starting point for problems where gradient information is unavailable or when simpler methods suffice.

The monty_sampler_random_walk() function allows us to define an RWMH sampler by passing the Variance-Covariance matrix of the proposal distribution as an argument.

vcv <- diag(2) * 0.1
sampler_rw <- monty_sampler_random_walk(vcv = vcv)

Once the sampler is built, the generic monty_sample() function can be used to generate samples:

res_rw <- monty_sample(m, sampler_rw, n_steps = 1000)
#> ⡀⠀ Sampling  ■                                |   0% ETA:  3s
#> ✔ Sampled 1000 steps across 1 chain in 49ms
#> 

This produces a chain of length n_steps that can be visualised:

plot(res_rw$pars[1, , 1],
     type = "l",
     ylab = "Value of sample",
     main = "MCMC chain from RWMH sampler")

Samples can also be visualised over the density:

image(a, b, z, xlab = "a", ylab = "b",
      main = "2D Gaussian model with RWMH samples")
points(t(res_rw$pars[, , 1]), pch = 19, col = "#00000055")

And we can look at the autocorrelation of the chain

acf(res_rw$pars[1, , 1], 200,
    main = "Autocorrelation of the RWMH samples")

We can see from the autocorrelation function that we have to wait almost 100 steps for the algorithm to produce a “new” (i.e. not correlated) sample, meaning a lot of calculation is “wasted” just trying to move away from previous samples. Improving the number of uncorrelated samples is at the heart of many strategies to improve the efficiency of Monte Carlo sampling algorithms. In particular, the adaptive Metropolis-Hastings sampler offers tools to automatically optimise the proposal distribution.

12.2.2 Adaptive Metropolis-Hastings Sampler

TODO: Content about the adaptive Metropolis-Hastings sampler to be placed here.

12.2.3 Nested models

TODO: Content about nested models to be placed here.

12.2.4 Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) leverages gradient information of the log-density to make proposals in a more directed manner than random-walk approaches. By treating the parameters as positions in a physical system and introducing “momentum” variables, HMC uses gradient information to simulate Hamiltonian dynamics to propose new states. This often allows the sampler to traverse the parameter space quickly, reducing random-walk behaviour and yielding lower autocorrelation.

Under the hood, HMC introduces auxiliary “momentum” variables that share the same dimensionality as the parameters. At each iteration, the momentum is drawn from a suitable distribution (typically multivariate normal) that decides the initial (random) direction of the move. The algorithm then performs a series of leapfrog steps that evolve both the parameter and momentum variables forward in “fictitious time.” These leapfrog steps use the gradient of the log-posterior to move through parameter space in a way that avoids the random, diffusive behaviour of simpler samplers. Crucially, at the end of these steps, a Metropolis-style acceptance step ensures that the correct target distribution is preserved. For a more complete introduction, you can see (Neal 2011).

Tuning HMC involves setting parameters such as the step size (often denoted \(\epsilon\)) and the number of leapfrog (or integration) steps. A small step size improves the accuracy of the leapfrog integrator but increases computational cost, whereas too large a step size risks poor exploration and lower acceptance rates. Likewise, the number of leapfrog steps determines how far the chain moves in parameter space before proposing a new sample. Balancing these factors can initially be more involved than tuning simpler methods like random-walk Metropolis. However, modern approaches - such as the No-U-Turn Sampler (NUTS) (Hoffman and Gelman 2014) - dynamically adjust these tuning parameters, making HMC more user-friendly and reducing the need for extensive manual calibration. Although NUTS is not yet available in monty, it is a high priority on our development roadmap.

By leveraging gradient information, HMC is often able to navigate complex, high-dimensional, or strongly correlated posteriors far more efficiently than random-walk-based approaches. Typically, it exhibits substantially lower autocorrelation in the resulting chains, meaning fewer samples are needed to achieve a given level of accuracy in posterior estimates. As a result, HMC has become a default choice in many modern Bayesian libraries (e.g. Stan, PyMC). Nevertheless, HMC’s reliance on smooth gradient information limits its applicability to models where the target density is differentiable - discontinuities can pose significant challenges. Moreover, while HMC can drastically reduce random-walk behaviour, the additional computations required for gradient evaluations and Hamiltonian integration mean that it is not always faster in absolute wall-clock terms, particularly for models with costly gradient functions.

12.2.4.1 An Illustrative Example: The Bendy Banana

To see HMC in action, we can compare it against a random-walk Metropolis sampler on a two-dimensional “banana”-shaped function. Our model takes two parameters alpha and beta, and is based on two successive simple draws, with one conditional on the other, so \(\beta \sim Normal(1,0)\) and \(\alpha \sim Normal(\beta^2, \sigma)\), with \(\sigma\) the standard deviation of the conditional draw.

We include this example within the package; here we create a model with \(\sigma = 0.5\):

m <- monty_example("banana", sigma = 0.5)
m
#> 
#> ── <monty_model> ───────────────────────────────────────────────────────────────
#> ℹ Model has 2 parameters: 'alpha' and 'beta'
#> ℹ This model:
#> • can compute gradients
#> • can be directly sampled from
#> • accepts multiple parameters
#> ℹ See `?monty_model()` for more information

We can plot a visualisation of its density by computing the density over a grid. Normally this is not possible, but here it’s small enough to illustrate:

a <- seq(-2, 6, length.out = 1000)
b <- seq(-2.5, 2.5, length.out = 1000)
z <- outer(a, b, function(alpha, beta) {
  exp(monty_model_density(m, rbind(alpha, beta)))
})
image(a, b, z, xlab = "alpha", ylab = "beta")

In this particular case we can also easily generate samples directly from the model, so we know what a good sampler should produce:

rng <- monty_rng_create()
s <- vapply(seq(200), function(x) monty_model_direct_sample(m, rng), numeric(2))
image(a, b, z, xlab = "alpha", ylab = "beta")
points(s[1, ], s[2, ], pch = 19, col = "#00000055")

It is also possible to compute the 95% confidence interval of the distribution using the relationship between the standard bivariate normal distribution and the banana-shaped distribution as defined above. We can check that roughly 10 samples (out of 200) are out of this 95% CI contour:

theta <- seq(0, 2 * pi, length.out = 10000)
z95 <- local({
  sigma <- 0.5
  r <- sqrt(qchisq(.95, df = 2))
  x <- r * cos(theta)
  y <- r * sin(theta)
  cbind(x^2 + y * sigma, x)
})
image(a, b, z, xlab = "alpha", ylab = "beta")
lines(z95[, 1], z95[, 2])
points(s[1, ], s[2, ], pch = 19, col = "#00000055")

12.2.4.1.1 Random-Walk Metropolis on the Banana

It is not generally possible to directly sample from a density (otherwise MCMC and similar methods would not exist!). In these cases, we need to use a sampler based on the density and - if available - the gradient of the density.

We can start with a basic random-walk sampler to see how it performs on the banana distribution:

sampler_rw <- monty_sampler_random_walk(vcv = diag(2) * 0.01)
res_rw <- monty_sample(m, sampler_rw, 1000)
#> ⡀⠀ Sampling  ■                                |   0% ETA:  1s
#> ✔ Sampled 1000 steps across 1 chain in 46ms
#> 
plot(t(drop(res_rw$pars)), type = "l", col = "#0000ff66",
     xlim = range(a), ylim = range(b))
lines(z95[, 1], z95[, 2])

As we can see, this is not great, exhibiting strong random-walk behaviour as it slowly explores the surface (this is over 1,000 steps).

Another way to view this is by plotting parameters over steps:

matplot(t(drop(res_rw$pars)), lty = 1, type = "l", col = c(2, 4),
        xlab = "Step", ylab = "Value")

We could try improving the samples here by finding a better variance-covariance matrix (VCV), but the banana shape means a single VCV will not be ideal across the whole space.

12.2.4.1.2 Hamiltonian Monte Carlo on the Banana

Now let’s try the Hamiltonian Monte Carlo (HMC) sampler, which uses gradient information to move more efficiently in parameter space:

sampler_hmc <- monty_sampler_hmc(epsilon = 0.1, n_integration_steps = 10)
res_hmc <- monty_sample(m, sampler_hmc, 1000)
plot(t(drop(res_hmc$pars)), type = "l", col = "#0000ff33",
     xlim = range(a), ylim = range(b))
lines(z95[, 1], z95[, 2])

Or viewed over steps:

matplot(t(drop(res_hmc$pars)), lty = 1, type = "l", col = c(2, 4),
        xlab = "Step", ylab = "Value")

Clearly, HMC outperforms the random walk, rapidly exploring the full banana-shaped region with far less random wander.

12.2.5 Parallel Tempering Sampler

Parallel tempering (also known as replica exchange MCMC) is a technique designed to improve the mixing of Markov chain Monte Carlo methods, particularly in situations where the posterior distribution is multimodal or somehow challenging to explore.

Let’s for example define a simple Bayesian model where the likelihood is a mixture model:

ex_mixture <- function(x) log(dnorm(x, mean = 5) / 2 + dnorm(x, mean = -5) / 2)
likelihood <- monty_model_function(ex_mixture, allow_multiple_parameters = TRUE)

and the prior is a normal distribution with a wider variance:

prior <- monty_dsl({
  x ~ Normal(0, 10)
})

and the posterior is simply the product of the two (or the sum if working with log-densities)

posterior <- likelihood + prior
x <- seq(-10, 10, length.out = 1001)
y <- exp(posterior$density(rbind(x)))
plot(x, y / sum(y) / diff(x)[[1]], col = "red", type = 'l', ylab = "density")

vcv <- matrix(1.5)
sampler_rw <- monty_sampler_random_walk(vcv = vcv)

Once the sampler is built, the generic monty_sample() function can be used to generate samples:

res_rw <- monty_sample(posterior, sampler_rw, n_steps = 1000)
plot(res_rw$pars[1, , ],
     ylim = c(-8, 8),
     ylab = "Value",
     main = "RW samples (1 chain)")
abline(h = -5, lty = 3, col = "red")
abline(h = 5, lty = 3, col = "red")

As we can see, the RW sampler does not manage to move outside of the mode that it starts from. MCMC theory tells us that it will eventually reach the other side of the distribution, whether with a (low probability) large jump or by accepting several consecutive small disadvantageous steps in the right direction. In practice, it can mean that the second mode might never be explored in the finite amount of time allowed for sampling.

One might think that running multiple chains from different initial values could solve the problem. Suppose then that we run three chains, starting at values -5, 0 and 5:

res_rw3 <- monty_sample(posterior, sampler_rw, n_steps = 1000, n_chains = 3,
                        initial = rbind(c(-5, 0, 5)))
matplot(res_rw3$pars[1, , ], type = "p", pch = 21,
        col = c("blue", "green", "purple"),
        ylim = c(-8, 8), ylab = "Value", main = "RW samples (3 chains)")
abline(h = -5, lty = 3, col = "red")
abline(h = 5, lty = 3, col = "red")

We can see with our three chains, we are exploring both modes, with \(\frac{1}{3}\) of our samples from one mode and \(\frac{2}{3}\) the other, which we know is not representative of the true distribution! So even if running multiple chains manages to help identify and explore more than one local mode, they cannot tell you anything about the relative densities of those modes if each chain ends up only exploring a single mode. Furthermore, there is no guarantee that simply running multiple chains from different starting points will even identify all modes, particularly in higher dimensions. We need a better way to make use of multiple chains, and that’s where parallel tempering comes in.

The core idea behind parallel tempering is to run multiple chains at different “temperatures” - where the posterior is effectively flattened or softened at higher temperatures - allowing the “hot” chains to move more freely across parameter space. Periodically, the states of the chains are swapped according to an acceptance rule that preserves the correct stationary distribution. The hope is that the higher-temperature chains will quickly explore multiple modes, and then swap states with the lower-temperature chains, thereby transferring information about other modes and improving overall exploration.

This can also be beneficial if your model can efficiently evaluate multiple points in parallel (via parallelisation or vectorisation). Even though parallel tempering runs multiple chains, the computational cost can be partly offset by improved mixing or by leveraging multiple CPU cores.

In monty, parallel tempering is implemented in the function monty_sampler_parallel_tempering() based on the non-reversible swap approach (Syed et al. 2022). Currently, this sampler uses a random-walk Metropolis step (the same as in monty_sampler_random_walk()) within each chain for local exploration, but we plan to allow other samplers for local exploration in future releases.

s <- monty_sampler_parallel_tempering(n_rungs = 10, vcv = matrix(1.5))

Here, n_rungs specifies the number of additional chains (beyond the base chain) to run at different temperatures (so total chains = n_rungs + 1). The argument vcv is the variance-covariance matrix that will be passed to the underlying random-walk sampler. An optional base argument can be used to specify a “reference” model if your original model cannot be automatically decomposed into prior and posterior components, or if you want an alternative easy-to-sample-from distribution for the hottest rung.

res_pt <- monty_sample(posterior, s, 1000, n_chains = 1)
plot(res_pt$pars[1,,],
     ylim = c(-8, 8),
     ylab = "Value",
     main = "PT samples")
abline(h = -5, lty = 3, col = "red")
abline(h = 5, lty = 3, col = "red")

hist(c(res_pt$pars), freq = FALSE,
     xlab = "Values of samples",
     main = "Parallel tempering samples")
x <- seq(min(res_pt$pars), max(res_pt$pars), length.out = 1001)
y <- exp(posterior$density(rbind(x)))
lines(x, y / sum(y) / diff(x)[[1]], col = "red")

Note

A parallel tempering sampler performs more total computations than a single-chain sampler because it runs multiple chains simultaneously (n_rungs + 1 chains) with only the “coldest” chain targeting the distribution of interest. However, there are scenarios where this additional cost pays off:

  1. Parallelisation: If your model can be efficiently evaluated across multiple cores, the wall-clock time may not be much larger than for a single chain - though CPU usage will naturally be higher.

  2. Vectorisation: In R, if the density calculations can be vectorised, then evaluating multiple chains in one go (in a vectorised manner) may not cost significantly more than evaluating a single chain.

  3. Multimodality: In models with multiple distinct modes, standard samplers often get “stuck” in a single mode. Parallel tempering can swap states between chains at different temperatures, making it more likely to fully explore all modes, thereby improving posterior exploration and reducing the risk of biased inference.