By Kevin Gray and Andrew Gelman
Marketing scientist Kevin Gray asks Columbia University Professor Andrew Gelman to spell out the ABCs of Bayesian statistics.
KG: Most marketing researchers have heard of Bayesian statistics but know little about it. Can you briefly explain in layperson’s terms what it is and how it differs from the ‘ordinary’ statistics most of us learned in college?
AG: Bayesian statistics uses the mathematical rules of probability to combine data with ‘prior information’ to give inferences which – if the model being used is correct – are more precise than would be obtained by either source of information alone.
Classical statistical methods avoid prior distributions. In classical statistics, you might include in your model a predictor (for example), or you might exclude it, or you might pool it as part of some larger set of predictors in order to get a more stable estimate. These are pretty much your only choices. In Bayesian inference you assign a prior distribution representing the set of values the coefficient can take. You can reproduce the classical methods using Bayesian inference, but in Bayesian inference you can do much more – by setting what is called an ‘informative prior’, you can partially constrain a coefficient, setting a compromise between noisy least-squares estimation and completely setting it to zero. It turns out this is a powerful tool in many problems.
KG: Could you give us a quick overview of its history and how it has developed over the years?
AG: The theory of Bayesian inference originates with its namesake, Thomas Bayes, an 18th-century English cleric, but it really took off in the late 18th century with the work of the French mathematician and physicist Pierre-Simon Laplace. Bayesian methods were used for a long time after that to solve specific problems in science, but it was in the mid-20th century that they were proposed as a general statistical tool.
Within statistics, Bayesian and related methods have become gradually more popular over the past several decades, often developed in different applied fields, such as animal breeding in the 1950s, educational measurement in the 1960s and 1970s, spatial statistics in the 1980s, and marketing and political science in the 1990s. Eventually a sort of critical mass developed in which Bayesian models and methods that had been developed in different applied fields became recognized as more broadly useful.
Another factor that has fostered the spread of Bayesian methods is progress in computing speed and improved computing algorithms. Except in simple problems, Bayesian inference requires difficult mathematical calculations – high-dimensional integrals – which are often most practically computed using stochastic simulation (computation using random numbers). This is the so-called Monte Carlo method, which was developed systematically by the mathematician Stanislaw Ulam and others when trying out designs for the hydrogen bomb in the 1940s and then rapidly picked up in the worlds of physics and chemistry. The potential for these methods to solve otherwise intractable statistics problems became apparent in the 1980s, and since then each decade has seen big jumps in the sophistication of algorithms, the capacity of computers to run these algorithms in real time, and the complexity of the statistical models that practitioners are now fitting to data.
Now, don’t get me wrong – computational and algorithmic advances have become hugely important in non-Bayesian statistical and machine learning methods as well. Bayesian inference has moved, along with statistics more generally, away from simple formulas toward simulation-based algorithms.
KG: What are its key strengths in comparison with Frequentist methods? Are there things that only Bayesian statistics can provide? What are its main drawbacks?
AG: I wouldn’t say there’s anything that only Bayesian statistics can provide. When Bayesian methods work best, it’s by providing a clear set of paths connecting data, mathematical/statistical models, and the substantive theory of the variation and comparison of interest. From this perspective, the greatest benefits of the Bayesian approach come not from default implementations, valuable as they can be in practice, but in the active process of model building, checking, and improvement. In classical statistics, improvements in methods often seem distressingly indirect – you try a new test that’s supposed to capture some subtle aspect of your data, or you restrict your parameters or smooth your weights, in some attempt to balance bias and variance. Under a Bayesian approach, all the tuning parameters are supposed to be interpretable in real-world terms, which implies – or should imply – that improvements in a Bayesian model come from, or supply, improvements in understanding of the underlying problem under study. The drawback of this Bayesian approach is that it can require a bit of a commitment to construct a model that might be complicated, and you can end up putting effort into modeling aspects of data that maybe aren’t so relevant for your particular inquiry.
KG: Are there misunderstandings about Bayesian methods that you often encounter?
AG: Yes, but that’s a whole subject in itself – I have written papers on the topic! The only thing I’ll say here is that Bayesian methods are often characterized as ‘subjective’ because the user must choose a ‘prior distribution’, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more ‘subjective’ than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more ‘objective’ than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information.
KG: Do you think Bayesian methods will one day mostly replace Frequentist statistics?
AG: There’s room for lots of methods. What’s important in any case is what problems they can solve. We use the methods we already know and then learn something new when we need to go further. Bayesian methods offer a clarity that comes from the explicit specification of a so-called ‘generative model’ – a probability model of the data-collection process and a probability model of the underlying parameters. But construction of these models can take work, and it makes sense to me that for problems where you have a simpler model that does the job, you just go with that.
Looking at the comparison from the other direction, when it comes to big problems with streaming data, Bayesian methods are useful but the Bayesian computation can in practice only be approximate. And once you enter the zone of approximation, you can’t cleanly specify where the modeling approximation ends and the computing approximation begins. At that point, you need to evaluate any method, Bayesian or otherwise, by looking at what it does to the data, and the best available method for any particular problem might well be set up in a non-Bayesian way.
KG: Does Bayesian inference have a special role in Big Data or the Internet of Things?
AG: Yes, I think so. The essence of Bayesian statistics is the combination of information from multiple sources. We call this data and prior information, or hierarchical modeling, or dynamic updating, or partial pooling, but in any case it’s all about putting together data to understand a larger structure. Big data, or data coming from the so-called internet of things, are inherently messy – scraped data not random samples, observational data not randomized experiments, available data not constructed measurements. So statistical modeling is needed to put data from these different sources on a common footing. I see this in the analysis of internet surveys where we use multilevel Bayesian models and non-random samples to make inferences about the general population, and the same ideas occur over and over again in modern messy-data settings.
KG: What are the developments in Bayesian statistics that might have an impact on the behavioral and social sciences in the next few years?
AG: Lots of directions here. From the modeling direction, we have problems such as polling where our samples are getting worse and worse, less and less representative, and we need to do more and more modeling to make reasonable inferences from sample to population. For decision making we need causal inference, which typically requires modeling to adjust for differences between so-called treatment and control groups in observational studies. And just about any treatment effect we care about will vary depending on scenario. The challenge here is to estimate this variation, while accepting that in practice we will have a large residue of uncertainty. We’re no longer in the situation where ‘p < .05’ can be taken as a sign of a discovery. We need to accept uncertainty and embrace variation. And that’s true no matter how ‘big’ our data are.
KG: Thank you, Andrew!
Kevin Gray is president of Cannon Gray, a marketing science and analytics consultancy. Andrew Gelman is Professor of Statistics and Political Science and Director of the Applied Statistics Center at Columbia University. Professor Gelman is also one of the principal developers of the Stan software, which is widely-used for Bayesian analysis.