Posterior Samplers for Turing.jl

Prompted by a question on the slack for Turing.jl about when to use which Bayesian sampling algorithms for which kinds of problems, I compiled a quick off-the-cuff summary of my opinions on specific samplers and how and when to use them. Take these with a grain of salt, as I have more experience with some than with others, and in any case the nice thing about a framework like Turing is that you can switch out samplers easily and test for yourself which is best for your application.

To get a good visual sense of how different samplers explore a parameter space, the animations page by Chi Feng is a great resource.

The following list covers mainly the samplers included by default in Turing. There’s a lot of work in Bayesian compuation with different algorithms or implementations of these algorithms which could lead to different conclusions.

  1. Metropolis Hastings (MH): Explores the space randomly. Extremely simple, extremely slow, but can “work” in most models. Mainly worth a try if everything else fails.

  2. HMC/NUTS: Gradient-based exploration, meaning the parameter space needs to be differentiable. It’s fast if that’s true, and so almost always the right choice if you can make your model differentiable (and sometimes so much better that it’s worth making your model differentiable even if your initial model isn’t in order to use it, e.g. by marginalizing out discrete parameters.) There are relatively minor differences between NUTS and the default HMC algorithm.

  3. Gibbs sampling: A “coordinate-ascent” like sampler which samples in blocks from conditional distributions. It used to be popular along with factorizable models where conditional distributions could be updated in closed form due to conjugacy. It’s still useful if you can do this, but slow for general models. Its main use now is for combining samplers, for example HMC for the differentiable parameters and something else for the nondifferentiable parameters.

  4. SMC/“Particle Filtering”: A method based on importance sampling, reweighting draws from an initial draw and repeatedly updating. It is designed to work well if the proposal distribution and updates are close to the targets. The number of particles should be large for reasonable accuracy. Turing’s implementation does this parameter by parameter starting at the prior and updating, which is close to what you want for the most common intended use, state space models with sequential structure, which is the main use case where I would even consider this. That said, tuning the proposals is really important, and more customizable SMC methods are useful in many cases where one has a computationally tractable approximate posterior you want to update to be closer to an exact posterior. This tends to be model-specific and not a good use case for generic PPLs, though.

  5. Particle Gibbs (PG), or “Conditional SMC”: like SMC, but modified to be compatible with Metropolis sampling steps. Its main use I can see is as a step in a Gibbs sampler, where it can be used for a discrete parameter, with HMC for the other parts. The number of particles doesn’t have to be overwhelmingly large, due to sampling, but it can cause problems if the number is too small.

  6. Stochastic Gradient methods (SGLD/SGHMC/SGNHT): Gradient based methods that subsample the data to get less costly but less accurate gradients for an approximation of deterministic gradient based methods (SGHMC approximates HMC, SGLD approximates Langevin descent which also uses gradients but is simpler and usually slightly worse than HMC). These are designed for large data applications where going through a huge data set each iteration may be infeasible. They are popular for Bayesian neural network applications, where optimization methods also rely on data subsampling.

  7. Variational Inference: Not a sampler per se. It comes up with a parametric model for the posterior shape and then optimizes the fit to the posterior according to a computationally feasible criterion (ie, one which doesn’t require computing the normalizing constant in Bayes rule), allowing you to optimize instead of sampling. In general, this has no guarantee of reaching the true posterior, no matter how long you run it, but if you want a slightly wrong answer very fast it can be a good choice. It’s also popular for Bayesian neural networks, and other “big” models like high dimensional topic models.

Related