In this post, I’d like to lay out a few questions and concerns I have about Bayesianism and Bayesian decision theory as a normative theory of inductive inference. As a positive theory, of what people do, psychology is full of demonstrations of cases where people do not use Bayesian reasoning (the entire “heuristics and biases” area), which is interesting but not my target. There are no new ideas here, just a summary of some old concerns which merit more consideration, and not even necessarily the most important ones, which are better covered elsewhere.
My main concerns are, effectively, computational. As I understand computer science, the processing of information requires real resources, (mostly time, but also energy, space, etc) and so any theory of reasoning which mandates isomorphism between statements for which computation is required to demonstrate equivalence is effectively ignoring real costs that are unavoidable and so must have some impact on decisions. Further, as I understand it, there is no way to get around this by simply adding this cost as a component of the decision problem.1 The problem here is that determination of this cost and reasoning over it is also computationally nontrivial, and so the costs of this determination must be taken into account. But determining these is also costly, ad infinitum. It may be the case that there is some way around this infinite regress problem via means of some kind of fixed point argument, though it is not clear that the limit of such an argument would retain the properties of Bayesian reasoning.
The question of these processing costs becomes more interesting to the extent that they are quantitatively nontrivial. As somebody who spends hours running and debugging MCMC samplers and does a lot of reading about Bayesian computation, my takeaway from this literature is that the limits are fundamental. In particular, there are classes of distributions such that the Bayesian update step is hard, for a variety of hardness classes. This includes many distributions where the update step is NP complete, so that our best understanding of P vs NP suggests that the time to perform the update can be exponential in the size of the problem (sampling from spin glass models is an archetypal example, though really any unrestricted distribution over long strings of discrete bits will do). I suppose a kind of trivial example of this is the case with prior mass 1, in which case the hardness reduces to the hardness of the deterministic computation problem, and so encompasses every standard problem in computer science. More than just exponential time (which can mean use of time longer than the length of the known history of the universe for problems of sizes faced practically by human beings every day, like drawing inferences from the state of a high resolution image), some integration problems may even be uncomputable in the Turing sense, and so not just wildly impractical but impossible to implement on any physical substrate (at least if the Church-Turing hypothesis is correct). Amusingly, this extends to the problem above of determining the costs of practical transformations, as determining whether a problem is computable in finite time is itself the classic example of a problem which is not computable.
So, exact Bayesianism for all conceivable problems is physically impossible, which makes it slightly less compelling as a normative goal. What about approximations? This will obviously depend on what counts as a reasonable approximation; if one accepts the topology in which all decisions are equivalent, then sure, “approximate” Bayesianism is possible. If one gets to stronger senses of approximation, such as requiring computation to within a constant factor, for cases where this makes sense, there are inapproximability results suggesting that for many problems, one still has exponential costs. Alternately, one could think about approximation in the limit of infinite time or information; this then gets into the literature on Bayesian asymptotics, though I guess with the goal exactly reversed. Rather than attempt to show Bayes converges to a fixed truth in the limit, one would try to show that a feasible decision procedure converges to Bayes in the limit. For the former goal, impossibility results are available in the general case, with positive results, like Schwartz’s theorem and its quantitative extensions ( notes and monograph) relying on compactness conditions that are more or less unsurprising given what is known on minimax lower bounds from information theory on what cannot be learned in a frequentist sense. For the other direction (whatever that might mean), I don’t know what results have been shown, though I expect, given the computational limitations in worst case prior-likelihood settings, that no universally applicable procedure is available.
How about if we restrict our demands from Bayesianism, for any prior and likelihood to Bayesian methods for some reasonable prior or class of priors? In small enough restricted cases, this seems obviously feasible: we can all do Beta-Bernoulli updating at minimal cost, which is great if the only information we ever receive is a single yes no bit. If we want Bayesianism to be a general theory of logical decision making, it probably has to go beyond that. Some people like the idea of Solomonoff induction, which proposes Bayesian inference with a “universal prior” over all computable distributions, avoiding the noncomputability problem in some sense. This proposes a prior mass on all programs exponentially decreasing in their length expressed in bits in the Kolmogorov complexity sense. Aside from the problem that it runs into computational hardness results for determining Kolmogorov complexity and so is not itself computable, running into the above issues again, there are some additional questions.
This exponentially decreasing tail condition seems to embed the space of all programs into a hyper-rectangle obeying summability conditions sufficient to satisfy Schwartz’s theorem for frequentist consistency of Bayesian inference. Hyperrectangle priors are fairly well studied in Bayesian nonparametrics: lower bounds are provided by the Assouad’s lemma construction and upper bounds are known and in fact reasonably small: by Brown and Low’s equivalence results, they are equivalent to estimation of a Hölder-smooth function, for which an appropriately integrated Brownian motion prior provides near-minimax rates. This seems to be saying that universal induction as a frequentist problem is slightly easier than one of the easier single problems in Bayesian nonparametrics. This seems… a little strange, maybe. One way to look at this is to accept, and say that the infinities contemplated by nonparametric inference are the absurd thing, or to marvel that a simple Gaussian Process regression is at least as hard as understanding all laws deriving the behavior of all activity in the universe and be grateful that it only takes cubic time. The other alternative is to suggest that this implies that the prior, while covering the entire space in principle, is satisfying a tightness condition that is so restrictive that it effectively restricts you to a topologically meager set of programs, ruling out in some sense the vast majority of the entire space (this sense is that of Baire category) in the same way that any two Gaussian process priors with distinct covariance functions are mutually singular. In this sense, it is an incredibly strong restriction and hard to justify ex ante, certainly at least as contestable as justifying an ex-ante fixed smoothing parameter for your GP prior. (If you’ve ever run one of these, you know that’s a dig: people make so many ugly blurry maps.)
Alternatives might arise in fixed tractable inference procedures, or the combination of tractable procedures and models, though all of these have quite a few limitations. MCMC has the same hardness problems as above if you ask for “eventual” convergence, and fairly odd properties if run up to a fixed duration (including nondeterministic outcomes and a high probability that those outcomes exhibit various results often called logical fallacies or biases, which I suppose is not surprising since common definitions of biases or fallacies appear to essentially require Bayesian reasoning to begin with.) Variational inference likewise has these issues: even with the variational approximation, note that optimization to reach the modified objective may still be costly or even arbitrarily hard in some cases. Various neuroscientists seem to have taken up some forms of variational inference as a descriptive model of brain activity. Without expertise in neuroscience, I will leave well enough alone and say it seems like something that merits further empirical exploration. But as somebody who runs variational inference in practice, with mixed and sometimes surprising results, and computational improvements that don’t always suggest that the issue of cost is resolved, it also doesn’t seem like a full solution. I’m happy my model takes an hour rather than two days to run; I’m not sure this makes the method a compelling basis for decision-making.
I was going to extend this to say something about Bayes Nash equilibrium, but my problems with that concept are largely distinct, coming from the “equilibrium” rather than the “Bayes”2 but I think I’ve conveyed my basic concerns. I don’t know that I have a compelling alternative, except that it may be the case that while an acceptable and actually feasible theory of decision making may have internal states of some kind, I see no reason that one has to have “beliefs” of any kind, at least as objects which reduce in some way to classical truth values over statements. One can simply have actions, which may on occasion correspond to binary decisions over sets that could in principle be assigned a truth value, though usually they won’t. This seems to have connections to the idea of lazy evaluation in nonparametric Bayes, which permits computations consistent with Bayes rule over a high-dimensional space to be retrieved via query without maintaining the full set of possible responses to such queries in memory. But this is only possible in a tractable way while still permitting the results to follow Bayesian inference for fairly limited classes of problems involving conjugacy. More generally, a theory which fully incorporates computational costs will likely have to await further developments in characterizing these costs, a still not fully solved problem in computer science.
This is something like what theories of “rational inattention” do. However, information processing costs in these theories are taken over a channel between information which still has the representation as a random variable on both sides. The agent is assumed to be on one side of this channel and so is effectively still dealing with information in a fully probabilistic form over which the optimization criterion still requires reasoning to be Bayesian. That is to say, rational inattention is a theory of the information available to an agent, not a theory of the reasoning and decision-making process given available information.↩
Roughly, even a computationally unlimited Bayesian agent could not reason itself to Bayes Nash equilibrium, unless it had the “right” priors. Where these priors come from is left unspoken (except that in the model they are “true”), which is a practical problem that drives a lot of differences between applied computational mechanism design, which is forced to answer this question, and the theory we teach our grad students.↩