Outline
- Decision Theory Review
- Regret Minimization
- The Online Learning Setting
- Discrete Case: Prediction with Expert Advice
- Realizable Case
- Nonrealizable Case
- Application: The Mind Reading Machine
- Model Combination and Hedge
Motivation
- Often, forecasting is not a single unique event, but a process that must be repeated over and over
- Forecasts must be updated in light of new data, and business decisions made on an ongoing basis
- May even need to choose a procedure in which process is entirely automated
- Example use cases
- Web services: may need to update page every day or even every visitor
- Large retail operation: need to predict inventories for each outlet each day
- Want a procedure which is both automated and robust
- Not enough time for practitioner to apply judgment each time model must be changed
- If something goes wrong, might not be able to intervene immediately
- These goals motivate online learning methods
- A class of algorithms for making a sequence of repeated forecasting decisions
- A distinct mode of decision theoretic analysis focused on minimizing worst-case regret
Decision Theory: Review
- Choose a forecasting rule
- \(f_{t}:\otimes_{s=1}^{t}\mathrm{Y}\to\mathrm{Y}\)
- Observe sequence of data
- \(\mathcal{Y}_t=\{y_s\}_{s=1}^{t}\in \otimes_{s=1}^{t} \mathrm{Y}\)
- Apply forecasting rule to obtain predicted value
- \(\widehat{y}_{t+1}=f_{t}(\mathcal{Y}_{t})\)
- Observe realized value \(y_{t+1}\)
- Evaluate by loss function
- \(\ell(.,.): \mathrm{Y}\times\mathrm{Y}\to\mathbb{R}^+\)
- Loss is \(\ell(y_{t+1},\widehat{y}_{t+1})\)
- Sequential setup is exactly this, but repeated for \(t=0\ldots T\)
- Need to compare losses across periods to compare \(T\) period sequence of decisions
Sequential Loss Evaluation
- Simplest approach: use total loss \(\sum_{t=1}^{T}\ell(y_{t+1},\widehat{y}_{t+1})\) (or, equivalently, average loss \(\frac{1}{T}\sum_{t=1}^{T}\ell(y_{t+1},\widehat{y}_{t+1})\))
- Sometimes also work with weighted average loss \(\sum_{t=1}^{T}w_t\ell(y_{t+1},\widehat{y}_{t+1})\)
- If \(w_t=\beta^t\), \(\beta\in(0,1)\), called exponential discounting: future profits discounted at interest rate if maximizing present value
- This is on top of existing issue of needing to compare performance across many sequences \(\mathcal{Y}_t\)
- Previous approaches: \(\mathcal{Y}_t\) comes from a distribution \(p(\mathcal{Y}_t)\), possibly in a family \(\mathcal{P}:=\{p(\mathcal{Y}_t,\theta),\theta\in\Theta\}\)
- Probability approach allows handling fact that multiple outcomes are possible by focusing on average case loss
- Risk \(R_t(f)=E_{p}[\ell(y_{t+1},f_{t}(\mathcal{Y}_t))]\)
- Statistical approach looked for low risk for every \(p\in\mathcal{P}\)
- Bayesian approach looked for low risk on average over \(p\in{P}\), where average is with respect to prior \(\pi(\theta)\)
- These approaches work fine if probability model good description of possible sequences we are likely to face
- Work best under stationarity or at least good model of changes over time
- Analyst judgment and intuition can and should be used to build model or method
- Extend to sequential setting by considering average risk \(=\) expected average loss
- \(\frac{1}{T}\sum_{t=1}^{T}E_{p}[\ell(y_{t+1},\widehat{y}_{t+1})]=E_{p}[\frac{1}{T}\sum_{t=1}^{T}\ell(y_{t+1},\widehat{y}_{t+1})]\)
Worst Case Analysis and No Free Lunch Theorems
- It would be even better if we had a method that can do well over all relevant sequences \(\underset{\mathcal{Y}_T}{\max}\sum_{t=1}^{T}\ell(y_{t+1},\widehat{y}_{t+1})\)
- Without additional restrictions, impossible to get nontrivial performance bounds over all sequences
- Consider case where \(\mathrm{Y}=\{0,1\}\) and \(\ell(y_{t+1},\widehat{y}_{t+1})=1\{y_{t+1}\neq \widehat{y}_{t+1}\}\). Total loss is number \(M\) of mistakes made.
- Given sequence \(\widehat{y}_{t+1}=f_{t}(\mathcal{Y}_t)\in\{0,1\}\), consider sequence \(y_{t+1}=(1-f_t(\mathcal{Y}_t))\)
- Worst case total loss \(M=T\) for any method
- Motivates randomizing: data fixed, but choices random
- Eg. flip a coin every period and guess 1 for Heads, 0 for Tails
- Then no matter what sequence is, expected total loss (with respect to own choices) is \(\frac{T}{2}\)
- No Free Lunch theorems give converse relationship
- Any algorithm producing a \(\{0,1\}\) sequence has loss \(\frac{T}{2}\) on average over all sequences
- Corresponds to data playing the same coinflip strategy
- Implication: without some assumptions on set of data faced, only trivial results possible
- Doing better against some sequences must mean doing worse against others
- With a little extra info, can focus on sequences that are relevant
Regret Analysis
- Intermediate between worst and average case analysis, may single out set of sequences which matter more
- Assess relative loss in comparison to some baseline
- Let \(\mathcal{F}_t=\{f_{t,\theta}(): \otimes_{s=1}^{t}\mathrm{Y}\to\mathrm{Y}:\ \theta\in\Theta\}\) be a set of “baseline” predictors against which we want to perform well
- Let \(\mathcal{A}\) be a strategy or algorithm: sequence of decision rules \(\{f_t(\mathcal{Y}_t)\}_{t=1}^{T}\) which may depend on the observations
- The maximum total regret of strategy \(\mathcal{A}\) with respect to comparison class \(\mathcal{\Theta}\) is
- \(reg_{T}(\mathcal{A})=\underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{\theta,t}(.))\right\}\)
- Average Regret is \(\frac{1}{T}reg_{T}(\mathcal{A})\)
- Because regret is defined over all sequences, no probability distribution needed over sequences
- But, it can be helpful to randomize to compete with possibly adversarial sequences
- A randomized algorithm/strategy \(\mathcal{A}=\{f_t(\mathcal{Y}_t,\omega_t)\}_{t=1}^{T}\) depends on auxiliary random variables \(\omega_t\sim q_t\)
- If adversary can condition on strategy but not on realizations of randomness, measure of loss is expected regret
- \(\underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\mathbf{E}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.,\omega_t))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{\theta,t}(.))\right\}\)
- Here, expectation is only over randomness in own decisions
Interpretation of Regret
- An algorithm with low regret performs nearly as well as the best rule in the baseline set in hindsight, after the sequence was seen, for any possible sequence
- Per no free lunch result, total loss on some sequences may be (indeed, has to be) terrible
- But if regret is low, this means that best rule in comparison class also did terribly on that sequence
- Conversely, whenever some rule does well on a sequence, a low regret strategy also does well
- Low expected regret methods perform well relative to the best baseline rule on average, for any possible sequence
- Different draws of randomization may do better or worse
- Regret captures situations where evaluation is against a benchmark
- Either explicitly as it is for hedge fund managers
- Or implicitly if you are competing with a group, and you can blame others for your mistakes
- “Worldly wisdom teaches that it is better for reputation to fail conventionally than to succeed unconventionally.” Keynes (1936)
- Less cynically, useful if you care more about accuracy in some sequences than others
- Different from probability, which captures this by “weights” these sequences
- May care about them more either because they are more likely or because performance in that case is important
Example 1: Prediction with Expert Advice
- Want to predict sequence of binary outcomes \(y_{t+1}\in\{0,1\}\) with 0-1 loss
- User will click/not click on ad, Product will sell or not sell, Recession this quarter or no
- We have access to a set of \(N\) “experts” \(\Theta=\{i\in 1\ldots N\}\) who each round give us their prediction \(f_{\theta,t}=\widehat{y}_{i,t+1}\in \{0,1\}\) of \(y_{t+1}\)
- Experts may be real human forecasters, like those surveyed by the Survey of Professional Forecasters or the Wall Street Journal Economic Survey
- Or they may be algorithms: eg. logistic regression, some bespoke Bayesian model, or some state-of-the-art machine learning thing
- Or just simple rules: Always 0, Always 1, Alternate, 1 on primes, etc
- Goal is to use all this info to make a forecast, recognizing that some experts may be more accurate than others
- Experts are “black boxes”: often impossible or too difficult to figure out where forecast comes from or why
- Only use history of past performance to evaluate
- Best expert makes \(M^*=\underset{i\in1\ldots N}{min}\sum_{t=1}^{T}1\{y_{t+1}\neq \widehat{y}_{i,t+1}\}\) mistakes
- Expected regret of an online learning procedure \(\mathcal{A}\) is \(\mathbf{E}\sum_{t=1}^{T}1\{y_{t+1}\neq \widehat{y}_{t+1}\}-M^*\)
- Goal is to make not many more mistakes than best expert
Example 2: Online Regression
- Let \(y_{t}\in\mathrm{Y}\) and \(\ell(.,.):\ \mathrm{Y}\times\mathrm{Y}\to B\subset\mathbb{R}\), \(B\) a bounded set
- Boundedness may come from \(\mathbf{Y}\) bounded, as in outcomes known to take limited range
- Needed for regret analysis, because if unbounded, worst case may achieve unbounded regret
- Worst case regret analysis useful if bad outcome is “customer doesn’t click the ad this time”
- Less so if worst outcome is “humanity perishes in nuclear war”
- Even though early theory came from David Blackwell at the RAND corporation, which supported early game and decision theory research for purpose of Cold War nuclear strategic planning
- Set \(\Theta=\{\beta\in\mathbb{R}^d: \left\Vert\beta\right\Vert\leq C \}\) and \(\mathcal{F}_t=\{z_t^{\prime}\beta, \beta\in\Theta\}\)
- \(z_t\) may be external predictors, lagged values, etc
- Regret of a procedure \(\mathcal{A}\) given by \(\underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\mathbf{E}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},z_t^\prime\beta)\right\}\)
- Goal is to perform as well as best regression model in hindsight
Algorithms
- What kind of algorithms will produce a low regret in a particular problem?
- Ideally, we would find method that solves minimax regret problem \(\underset{\mathcal{A}}{\min}Reg_{T}(\mathcal{A})\)
- Minimizes over all possible sequences of prediction rules the maximum over all sequences the loss relative to the best sequence-specific benchmark
- This is insanely hard computationally and nobody ever does it
- Instead, introduce some popular simple methods and show that they perform reasonably
- Method may be proper: choosing an element of the comparison set \(f_t\in\mathcal{F}_t\) each time
- E.g. prediction must be same as that of a particular expert, or must be linear
- If randomized, may choose a random element of \(\mathcal{F}_t\)
- Or it may be improper: \(f_t\notin\mathcal{F}_t\)
- Using advice of all experts, combine in some way which is not the same as any particular expert
- Use possibly nonlinear function, and compare to best linear function
- Will explore examples of both types
- Improper setting, which is less restricted, often yields stronger results, but properness sometimes a genuine restriction
Realizable Case
- Some especially simple problems are realizable: the best possible predictor is in the comparison set \(\mathcal{F}_t\)
- \(\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{t,\theta})=0\)
- Regret minimization then equivalent to minimax case
- Not realistic outside of settings where goal is to ascertain some unambiguous matter of fact
- But useful to show how analysis works
- In expert advice setting, corresponds to \(M^*=0\): at least one expert knows the true sequence
- Goal is to find this expert as quickly with as few mistakes as possible
- A possible algorithm to use here, with good properties, is Majority Vote
- Give each expert an initial weight \(w_{i,1}=1\), then at each t, do as follows:
- Choose 1 if \(\sum_{i=1}^{n}w_{i,t}1\{\widehat{y}_{i,t+1}=1\}>\sum_{i=1}^{n}w_{i,t}1\{\widehat{y}_{i,t+1}=0\}\), 0 otherwise
- Vote with majority of experts
- After \(y_{t+1}\) realized, set weight of any expert \(i\) which made a mistake to 0
- If we know best rule is exact, can ignore any rule as soon as it makes a mistake
Analysis of Majority Vote
- Let \(W_t=\sum_{i=1}^{n}w_{i,t}\) be the number of remaining experts after \(t\) steps
- If in period \(t\) a mistake is made, more than half of experts must have been wrong, so \(W_{t+1}\leq\frac{W_t}{2}\)
- If \(M\) is total number of mistakes over \(T\) rounds then \(W_{T+1}\leq n 2^{-M}\)
- Further \(W_{T+1}\geq 1\), since by realizability, at least one expert doesn’t make mistakes
- Rearranging \(n 2^{-M}\geq 1\), \(M\leq \log_{2}n\)
- Regret satisfies bound \(reg_{T}(\mathcal{A})\leq\log_{2}n\)
- Average regret \(\leq\frac{\log_{2}n}{T}\) goes to 0 as long as \(n\) not exponential in \(T\)
- Can deal with very large number of wrong predictors and still make few mistakes
- In general, and certainly for a proper algorithm, vanishing average regret is best that can be hoped for
- Compare method with minority vote: among those that haven’t made a mistake, choose smallest group
- Could remove as few as 1 expert per mistake, leading to up to \(n-1\) mistakes in worst case
Non-realizable Case
- Outside of realizable case, problem much harder
- Suppose expert set contains i,j s.t. \(f_{i,t}=0\) \(\forall t\) and \(f_{j,t}=1\) \(\forall t\)
- Take any algorithm depending only on past data to pick outcome \(\widehat{y}_{t+1}\)
- Then sequence which always picks \(y_{t+1}=1-\widehat{y}_{t+1}\) is possible, giving \(T\) mistakes
- Further, this sequence either has \(\leq\frac{1}{2}\) 1s or \(\leq\frac{1}{2}\) 0s
- So either \(i\) or \(j\) has loss \(\leq\frac{T}{2}\), respectively
- Regret is \(\geq T-\frac{T}{2}=\frac{T}{2}\): allowing comparators does not make the worst case that much better
- But, randomized algorithms can do fine in terms of expected regret
- Let \(p_{i,t}\) be probability of choosing expert \(i\) in round \(t\)
- Then Expected regret is \(\sum_{t=1}^{T}\sum_{i=1}^{n}p_{i,t}|y_{t+1}-\widehat{y}_{i,t+1}|-\underset{i}{\min}\sum_{t=1}^{T}|y_{t+1}-\widehat{y}_{i,t+1}|\)
Randomized Weighted Majority
- Simple changes to Majority Vote lead to low expected regret in non-realizable case
- Sample with probability proportional to weight
- \(p_{i,t}=\frac{w_{i,t}}{\sum_{i=1}^{n}w_{i,t}}\)
- Since even best expert may make mistakes, don’t discard completely
- Instead, if expert \(i\) makes mistake, set \(w_{i,t+1}=(1-\epsilon)w_{i,t}\)
- Analysis (cf Hazan 2015) similar to majority vote: let \(M(i)\) be # of mistakes of expert \(i\)
- \(W_T=\sum_{i=1}^{n}w_{i,T}\) has lower bound of \((1-\epsilon)^{M(i)}\)
- Can show upper bound of \(n\exp(-\epsilon\mathbf{E}M)\) in terms of total mistakes
- Results imply regret \(\mathbf{E}M-\underset{i}{\min}M(i)\leq \epsilon T +\frac{\log n}{\epsilon}\)
- Choosing \(\epsilon=\sqrt{\frac{\log n}{T}}\), upper bound is \(2\sqrt{T\log n}\)
- Method has average expected regret \(\leq \frac{2\sqrt{\log N}}{\sqrt{T}}\)
- Goes to 0 with \(T\), and depends only weakly on number of experts
Claude Shannon’s Mind-Reading Machine
- A simple example of online binary prediction: a machine that predicts user’s next choice, left or right
- If machine guesses right, it gets a point, if wrong, user gets a point
- Goal is to build machine that can win every time
- In principle, user could guarantee \(\frac{T}{2}\) points and a tie just by random guessing
- In practice, humans are really bad at randomness without tools like coins, dice, computers
- With online algorithm, can build machine that usually wins, as Claude Shannon did at Bell labs in 1950s
Photo: Poundstone (2014)
Hedge
- Binary case can be generalized to experts making any kind of prediction, with arbitrary non-negative loss function
- Any output: discrete, scalar, vector, etc, all permitted
- Sufficient that \(\ell^2(.,.)\leq B\): bounded loss, eg due to bounded outcomes
- Hedge algorithm takes in \(n\) experts, with initial weights \(w_{i,0}=1\)
- Predict \(y_{t+1}\) using expert \(i\) forecast \(\widehat{y}_{i,t+1}\) with probability \(p_{i,t}=\frac{w_{i,t}}{\sum_{i=1}^nw_{i,t}}\)
- Observe losses \(\ell_{i,t}=\ell(y_{t+1},\widehat{y}_{i,t+1})\) for each expert
- Update weights to \(w_{i,t+1}=w_{i,t}\exp(-\epsilon\ell_{i,t})\) and repeat for next forecast
- Analysis shows expected regret bounded by \(\epsilon\sum_{t=1}^{T}\sum_{i=1}^{n}p_{i,t}\ell^2_{i,t} +\frac{\log n}{\epsilon}\)
- Choosing \(\epsilon=\sqrt{\frac{\log n}{BT}}\), upper bound is \(2\sqrt{B T\log n}\)
- Regret grows sublinearly in \(T\), depends weakly on number of experts
Hedge: Applications
- Hedge provides alternative approach to model selection
- Can use for hyperparameter choice like AIC/BIC/Cross-validation etc
- Allows combining input of large set of models of diverse types
- Choice depends on past performance only, not model structure
- Inputs can be simple or complicated, or even based on human judgment
- Can’t use AIC to compare VAR, random forest, and a CNBC pundit
- Does not require stationarity or correct specification or existence of a probability distribution over outcomes
- Does require boundedness: useful for outcomes with known range
- Guarantees performance near best in class on realized outcomes
- Contrast oracle guarantees: performance near best in class for risk
- Rapidly (exponentially fast) downweights bad models, maintains weights on those with good performance
- Essentially random near start of run: good performance coming from middle-to-end
- This is typical feature of online algorithms
- Choice of update speed \(\epsilon\) depends on horizon \(T\)
General Purpose Methods: Online Convex Optimization
- Can we move beyond case of finite comparison class of “experts” to methods that apply to more general variable structures and comparison classes?
- One large class of problems where this is feasible are online convex optimization problems
- If realized loss functions \(\ell(y_{t+1},.)\) are convex in \(f_t(.)\), there are general purpose methods with strong regret guarantees
- Convexity means \(\ell(y_{t+1},ax+(1-a)z)\leq a\ell(y_{t+1},x)+(1-a)\ell(y_{t+1},z)\) for any \(a\in[0,1]\)
- Bowl shaped functions
- Contains quadratic loss, absolute value loss, linear loss, etc
- Does not contain discrete case: randomization needed to make loss linear
- Hedge and online regression with convex \(\ell\) are special cases of this setup
- Next class: general algorithms and conditions
- Preview: Look a lot like (regularized) ERM
Conclusions
- Online learning methods provide principled way to evaluate and implement stream of iteratively updated predictions
- Regret based analyses ensure robustness against arbitrary sequences, relative to a set of comparison models
- An algorithm that produces low total expected regret over time can be trusted to have adequate performance regardless of what kind of new data comes in
- For finite model sets, a useful class of procedures puts weights on the models based on past performance, then chooses randomly based on weights
- Hedge algorithm takes many models or experts as inputs, outputs low expected regret sequence of predictions
- Randomization critical, as without it worst case regret is linear in \(T\)
References
- Elad Hazan Introduction to Online Convex Optimization Foundations and Trends in Optimization, vol. 2, no. 3-4 (2015) 157–325.
- Textbook on online learning, from an optimization theory perspective
- John Maynard Keynes The General Theory of Employment, Interest, and Money (1936)
- Aside from introducing whole framework of 20th century macroeconomics, insightful chapters on investor psychology
- Shai Shalev-Shwartz Online Learning and Online Convex Optimization Foundations and Trends in Machine Learning Vol. 4, No. 2 (2011) 107–194
- Shorter intro to online learning