Methods: Introduction to Bayesian Analysis
Introduction
So far, my posts have analysed the Sliced datasets using a mixture of traditional statistical models and machine learning methods, but a New Year brings a new approach. In this series of posts, I will return to those same datasets and reanalyse them using what statisticians call Bayesian models and data scientists sometimes call Probabilistic Programming Languages (PPLs).
When I reanalyse the data, I want to concentrate on the models and the R code without repeatedly explaining the logic of Bayes theorem or the ideas behind the various Markov Chain Monte Carlo (MCMC) algorithms. To free my analysis posts from the need for continual justification, this post presents an overview of Bayesian methods and their terminology and I have written three preliminary methods posts that explain,
- MCMC algorithms
- R Software for Bayesian Analysis
- Assessing MCMC output
If you are already familiar with Bayesian analysis, then you can skip these methods posts and jump right into the analyses.
Bayesian Terminology
For me, a model is an equation or a piece of code that generates data. A model could be deterministic or probabilistic, but, for the purposes of these posts, I am only interested in the latter. A model has to be completely defined. So a normal distribution with a mean of 6 and a standard deviation of 0.5 is a probabilistic model, but a normal distribution with an unspecified mean and/or standard deviation is a family, or collection, of models.
Models are made specific by specifying the values for their parameters, so you can think of a set of parameters as indexing a particular model within a family.
I use \(\theta\), which is usually a vector, as the name for a set of parameters. I use \(y\) to stand for any set of data generated by a model, but when I want to emphasise that I am referring to the actual observed data, I call it \(y^*\).
The Basic Idea of Bayesian Analysis
Traditional statistical analysis is based on models that tell us the set of probabilities that should be used when generating data
\[
P(data \ | \ model) = P( y \ | \ \theta)
\]
The numerical value of this probability for the observed set of data, \(y^*\)is called the Likelihood
\[
Likelihood = P( observed \ data \ | \ model) = P(y^* \ | \ \theta)
\]
The likelihood is just a number.
Two fully specified models indexed by \(\theta_1\) and \(\theta_2\) are compared by the ratio of their likelihoods, \[ Likelihood \ Ratio = \frac{P(y^* \ | \ \theta_1)}{P(y^* \ | \ \theta_2)} \] The model that is more likely to have generated the observed data is preferred. This leads naturally to the idea of maximum likelihood; of all possible values of \(\theta\), which is most likely to have generated the observed data.
I have glossed over a couple of technical points. Firstly, if the data are continuous, then the probabilities become probability densities and secondly, because interest is in the ratio of likelihoods, terms that cancel are often dropped from the definition of a likelihood, so that it is common to write, \[ Likelihood \propto P(y^* \ | \ \theta) \]
The Bayesian argument is that, when the aim is to compare models, it makes more sense to reverse the conditioning and use, \[ P(model \ | \ observed \ data) = P( \theta \ | \ y^*) \] Given an observed set of data, what is the probability that it was generated by the model indexed by \(\theta\)? For a Bayesian, the observed data are given and the objective is to say something about the model and not the other way around.
Bayes theorem tells us how the two approaches are related \[ P( \theta \ | \ y^* ) = \frac{P( y^* \ | \ \theta) P(\theta)}{P(y^*)} \] Each of the terms in this expression has a name, so in words, \[ Posterior = \frac{Likelihood \ \text{x} \ Prior}{Marginal \ Likelihood} \]
The argument for preferring the posterior distribution to the likelihood as the basis of statistical inference is a strong one, but Bayes theorem highlights the two big problems of Bayesian inference.
- Bayesian inference is only possible if you can specify the prior
- Bayesian inference is only possible if you can calculate the marginal likelihood
The prior describes the probability that a particular model will be used for data generation based on what was known before the data were seen. This always boils down to a subjective judgement and it requires extra mental gymnastics because real data are not generated by a model.
The marginal likelihood is the probability of the observed data averaged over the entire family of models that are indexed by \(\theta\), which requires firstly, that you can specify all possible models and secondly, that you can do the computation. When \(\theta\) is continuous, the quantity that you need to calculate can be expressed as an integral. \[ p(y^*) = \int P(y^* \ | \ \theta) P( \theta) d\theta \] When \(\theta\) is discrete, the integral becomes a sum over the possible values of \(\theta\).
For many years, traditionalists would express unease about using a subjective prior, but the real reason that Bayesian analysis was not more widely used was that the computation of the marginal likelihood was too difficult. Then, along came Markov Chain Monte Carlo (MCMC) and Bayesian analysis took off.
Prediction
Machine learning usually uses a model for prediction, so before we get into the mechanics of computing posterior probabilities and marginal likelihoods, let’s see how a Bayesian would tackle a prediction problem. This will also involve complex calculations, but once we establish a method for finding posteriors, all of the other computations will follow.
Let’s suppose that we want to predict future data, \(y\) given an observed set of data \(y^*\), when we do not know for certain which model (value of \(\theta\)) is being used to generate the data. In a Bayesian analysis, we obtain the predictive distribution of future data by averaging over all possible models weighted by their posterior probabilities.
\[
P( y \ | \ y^*) = \int P(y \ | \ \theta, y^*) \ P(\theta \ | \ y^*) d\theta
\]
Almost certainly, we will assume that once we know the model (i.e. the exact \(\theta\)), future data, \(y\), is independent of past data, \(y^*\). So, we can rewrite the predictive distribution as \[ P( y \ | \ y^*) = \int P(y \ | \ \theta) \ P(\theta \ | \ y^*) d\theta \]
We see that, for a Bayesian, a prediction is a distribution and not just a number. A Bayesian believes that future data will look like a random draw from \(P( y \ | \ y^*)\). If you had to make a guess at a single number (point prediction), then the choice that you would make would depend on the cost associated with making a mistake.
Machine learning problems usually specify a loss function, which they seek to minimise, but we could equally well work in terms of minus the loss, variously called the reward function or the utility function, which we would seek to maximise. Bayesian statisticians usually prefer maximising utility, but I’ll stick to minimising the loss as it is closer to machine learning.
If the loss associated with a wrong prediction is quadratic, that is, the loss goes up with the square of the difference between the point prediction and the actual future observation, then the expected loss is minimised by the mean of the predictive distribution. If the loss is linear, then we minimise the expected loss by using the distribution’s median as the point prediction and if the loss is a zero-one function (all errors are equally bad), then the prediction that minimises the expected loss is the mode of the predictive distribution.
Generally, if L is the loss, we seek a point prediction, \(Y\), to minimise \[ \int L(y - Y) \ P( y \ | \ y^*) dy \]
Specifying the Prior
It is hard enough to select a family of models in advance of seeing the data, but to rank each member of the family in terms of its probability is almost impossible. Yet that is what the prior requires.
For most problems, the prior that we specify for \(\theta\) will be very vague or imprecise. This just says that before we collect the data, we usually know very little about the data generating mechanism. Most of the information about the model will come from the data. This is not a problem, but a statement of the way things are. As a result, we specify the prior as best we can, knowing that for most problems the prior will not be critical.
However, for small datasets, the prior will have a stronger influence on the posterior and there are models that include parameters that are poorly estimated even by large datasets, so, again, the prior will be influential. Obviously, you cannot try different priors and then choose the one that most appeals. Whatever prior you choose, that is the one that you have to use, but it is still useful to know if the analysis is sensitive to that choice, which you can discover by trying a range of other priors.
My personal opinion is that data scientists should spend time thinking about their priors rather than just opting for a standard “vague” prior. I hope that the logic of this position will come through in the Sliced analyses.
Computation and the other methods posts
Substituting the formula for the marginal likelihood into Bayes theorem we get \[ P( \theta \ | \ y^* ) = \frac{P(y^* \ | \ \theta) P( \theta)}{\int P(y^* \ | \ \theta) P( \theta) d\theta} \] and for years Bayesian analysis was restricted to problems for which the integral was analytically simple, or of low enough dimension that it could be approximated numerically.
The break-through came when statisticians adopted iterative simulation algorithms, many of which had been developed for solving problems in physics. The basic idea is to simulate a sequence of values of \(\theta\) using an algorithm in which the next value depends on the current value (a Markov Chain). The trick is to design the algorithm is such a way that values of \(\theta\) in the chain occur with probability equal to the posterior probability that is being represented.
The most commonly used MCMC algorithms are Metropolis-Hastings samplers, Slice samplers and the Hamiltonian Monte Carlo (HMC) samplers. I describe these algorithms in the first of my Bayesian methods posts.
There are several programs and R packages that are specifically designed for general Bayesian analysis, they include the BUGS family of programs, the nimble package, Greta and Stan. I describe these programs in a second methods post and I will use them all in my analyses of the Sliced datasets.
The Bayesian programs all provide a language for specifying the desired family of models and a range of MCMC algorithms that can be used to obtain samples from the posterior or the predictive distribution. They also have the internal logic needed to recommend sensible samplers for that model.
Many thousands of samples may be needed to represent a multidimensional posterior, so the full MCMC computation can be time consuming. R is a slow language, so Bayesian computation in R would not be practical and the programs either export the computation to faster software, or they translate the R code into a faster language such as C++ and then compile the code before running the samplers.
Having obtained an MCMC sample, it is important to check that it is an adequate representation of the target distribution and then we need to summarise and communicate what the sample tells us about the model. The inspection and summary of an MCMC sample is described in the third of my methods posts.