On Online Learning for Economic Forecasts

Jérémy Fouliard, Michael Howell, and Hélène Rey have just released an update of their working paper applying methods from the field of Online Learning to forecasting of financial crises, demonstrating impressive performance in a difficult forecasting domain using some techniques that appear to be unappreciated in econometrics. Francis Diebold provides discussion and perspectives. This work is interesting to me as I spent much of the earlier part of this year designing and running a course on economic forecasting which attempted to offer a variety of perspectives beyond the traditional econometric approach, including that of Online Learning.1 This perspective has been widely applied by machine learning practitioners and businesses that employ them, particularly major web companies like Google, Yahoo, and Microsoft, but has not seen widespread take-up by more traditional economic forecasting consumers and practitioners like central banks and government institutions.

The essence of the online learning approach has less to do with particular algorithms (though there are many), but instead reconsiders the choice of evaluation framework. Consider a quantity to be forecast \(y_{t+h}\), like an indicator equal to 1 in the presence of a financial crisis. A forecasting rule \(f(.)\) depending on currently available data \(\mathcal{Y}_T\) produces a forecast \(\widehat{y}_{t+h}=f(\mathcal{Y}_T)\) which can be evaluated ex post according to a loss function \(\ell(y_{t+h},\widehat{y}_{t+h})\) which measures how close the forecast was to being correct. Since we don’t know the true value \(y_{t+h}\) until it is observed, to make a forecast we must come up with a criterion instead which compares rules. Traditional econometric forecasting looks at measures of statistical risk,

\[E[\ell(y_{t+h},\widehat{y}_{t+h})]\]

where expectation is taken with respect to a (usually unknown) true probability distribution. Online learning, in contrast, aims to provide estimators which have low regret over sequences of outcomes \(\{y_{t+h}\}_{t=1}^{T}\) relative to a comparison class \(\mathcal{F}\) of possible rules, \[\text{Regret}(\{\widehat{y}_{t+h}\}_{t=1}^{T})=\sum_{t=1}^{T} \ell(y_{t+h},\widehat{y}_{t+h})-\underset{f\in\mathcal{F}}{\inf}\sum_{t=1}^{T}\ell(y_{t+h},f(\mathcal{Y}_{t}))\]

This criterion looks a little odd from the perspective of traditional forecasting rules: Diebold has expressed skepticism. First, regret is a relative rather than absolute standard; to even be defined you need to take a stand on rules you might compare to, which can look somewhat arbitrary. If you choose a class of rules that predict poorly, a low regret procedure will not do well in an absolute sense. Second, there is no expectation or probability, just a sequence of outcomes. Diebold frames this as ex ante vs ex post, as the regret cannot be computed until after the data is observed, while risk can be computed without seeing the data. But this does not accord with how regret is applied in the theoretical literature. Risk can be computed only with respect to a probability measure, which has to come from somewhere. One can build a model and ask that this be the “true” probability measure describing the sequence generating the data, but this is also unknown. To get ex ante results for a procedure, one needs to take a stand on a model or class of models. Then one can evaluate results either uniformly over models in the class (this is the classic statistical approach, used implicitly in much of the forecasting literature, like Diebold’s textbook) or on average over models, where the distribution over which one averages is called a prior distribution and leads to Bayesian forecasting. In the online learning context, in contrast, one usually seeks guarantees which apply for any sequence of outcomes, as opposed to over a distribution. So the results are still ex ante, with the difference being whether one needs to take a stance on a model class or a comparison class. There are reasons why one might prefer either approach. For one, standard decision theory requires use of probability in “rational” decision making. But the probabilistic framework is often extremely restrictive in terms of the guarantees it provides on the type of situations in which a procedure will perform well. In general, one must have a model which is correctly specified, or at least not too badly misspecified.

Especially for areas where the economics is still in dispute, the confidence that one has that the models available to us encompass all likely outcomes maybe shouldn’t be so high. This is the content of the Queen’s question to which the title of the FHR paper refers: many or most economists before the financial crisis were using models which did not foresee a particularly high probability of such an event. For that reason, a procedure which allows us to perform reasonably over any sequence of events, not just those likely with respect to a particular model class, may be particularly desirable; a procedure with a low regret guarantee will do so, and be known to do so ex ante, as long as there is some comparator which performed well ex post. Ideally, we would like to remove that latter caveat, but as economists like to say, there is no free lunch. One can instead do analysis based on risk without assuming one has an approximately correct model; this is the content of statistical learning theory. But this usually involves both introducing a comparison class of models \(\mathcal{F}\) to study a relative criterion, the oracle risk \(E[\ell(y_{t+h},\widehat{y}_{t+h})]-\underset{f\in\mathcal{F}}{\inf}E\ell(y_{t+h},f(\mathcal{Y}_t))\), or variants thereof.2 This requires both a comparison class and some restrictions on distributions to get uniformity; Vapnik considered the i.i.d. case, which is unsuitable for most time series forecasting applications; extensions need some version of stationarity and/or weak dependence, or strong conditions on the class of nonstationary processes allowed, which can be problematic when one does not know what kind of distribution shifts are likely to occur.

This brings us to the content of the forecast procedures used: FHR apply classic Prediction with Expert Advice algorithms, like versions of Exponential Weights (closely related to the “Hedge” algorithm of Freund and Schapire 1997) and Online Gradient Descent (Zinkevich 2003), which take a set of forecasts and form a convex combination of them with weights that update each round of predictions. Diebold notes that these are essentially versions of model averaging procedures which allow for time-varying weights, which are frequently studied by econometricians, complaining that “ML types seem to think they invented everything”. To this I have two responses. First, on a credit attribution level, the online learning perspective originates in the studies of sequential decision theory and game theory from people like Wald and Blackwell, squarely in the economics tradition, and the techniques became ubiquitous in the ML field through “Prediction, Learning, and Games”, by Cesa-Bianchi and Lugosi, the latter of whom is in an Economics department. So if one wants to claim credit for these ideas for economics, go right ahead. Second, there are noteworthy distinctions between these ideas and statistical approaches to forecast combination. Next, the uniformity over sequences of the regret criterion ensures that not only does it permit changes over time, these can be completely arbitrary and do not have to accord with a particular model of the way in which the shift occurs. So while the approaches can be analyzed in terms of statistical properties, and may correspond to known algorithms for a particular model, the reason they is used is to ensure guarantees for arbitrary sequences, a property which is not shared by general statistical approaches. In fact, a classic result in online model combination (Cf Section 2.2 in Shalev-Shwartz) shows that some approaches with reasonable risk properties, like picking the forecast with the best performance up to the current period, can give unbounded regret for particularly bad sequences. The fact that a combination procedure adapts to these “poorly behaved” sequences is more important than the fact that it gives time varying weights per se.

For these reasons, I think Online Learning approaches at minimum deserve more use in economic forecasting and I am pleased to see the promising results of FHR, as well as the growing application of minimax regret criteria in other areas of economics like inference and policy under partial identification and areas like mechanism design where providing a well-specified distribution over outcomes can be challenging.

There are still many issues that need more exploration, and there are important limitations. One thing existing online methods do not handle well is fully unbounded data; the worst case over all sequences leads to completely uninformative bounds, even for regret. For this reason, forecasting indicators is a good place to start. Whether it is even possible to extend these methods to data with unknown trends is still an open question, which may limit their suitability for many types of economic data. Tuning parameter selection is a topic of active research, with plenty of work on adapting these to the interval length and data features. Methods which perform well by regret criteria but also adapt to the case in which one does have stationary data and so could do well with a model-based algorithm are also a potentially promising direction. If one has real confidence in a model, it makes sense to rely on it, in which case statistical approaches are fine. But for many applications where the science is less settled and one might plausibly see data that doesn’t look like any model we have written down, we should keep this in our toolbox.


  1. For a better overview of the field than I can provide, see the survey of Shalev-Shwartz, the monograph of Hazan, or courses by Rakhlin or Rakhlin and Sridharan.

  2. Econometricians are used to thinking of this from the perspective of misspecification a la Hal White in which one compares to risk under the “pseudo-true” parameter value of the best predictor in a class. An alternative, popular in machine learning, is to use a data dependent comparator, the empirical risk, and prove bounds on the generalization gap. Here again we are effectively using the performance of a model or algorithm class for a relative measure.

Related