Online Learning and Regret Minimization

Outline

Decision Theory Review
Regret Minimization
The Online Learning Setting
Discrete Case: Prediction with Expert Advice
- Realizable Case
- Nonrealizable Case
- Application: The Mind Reading Machine
Model Combination and Hedge

Motivation

Often, forecasting is not a single unique event, but a process that must be repeated over and over
Forecasts must be updated in light of new data, and business decisions made on an ongoing basis
May even need to choose a procedure in which process is entirely automated
Example use cases
- Web services: may need to update page every day or even every visitor
- Large retail operation: need to predict inventories for each outlet each day
Want a procedure which is both automated and robust
- Not enough time for practitioner to apply judgment each time model must be changed
- If something goes wrong, might not be able to intervene immediately
These goals motivate online learning methods
- A class of algorithms for making a sequence of repeated forecasting decisions
- A distinct mode of decision theoretic analysis focused on minimizing worst-case regret

Decision Theory: Review

Choose a forecasting rule
- \(f_{t}:\otimes_{s=1}^{t}\mathrm{Y}\to\mathrm{Y}\)
Observe sequence of data
- \(\mathcal{Y}_t=\{y_s\}_{s=1}^{t}\in \otimes_{s=1}^{t} \mathrm{Y}\)
Apply forecasting rule to obtain predicted value
- \(\widehat{y}_{t+1}=f_{t}(\mathcal{Y}_{t})\)
Observe realized value \(y_{t+1}\)
Evaluate by loss function
- \(\ell(.,.): \mathrm{Y}\times\mathrm{Y}\to\mathbb{R}^+\)
Loss is \(\ell(y_{t+1},\widehat{y}_{t+1})\)
Sequential setup is exactly this, but repeated for \(t=0\ldots T\)
- Need to compare losses across periods to compare \(T\) period sequence of decisions

Sequential Loss Evaluation

Simplest approach: use total loss \(\sum_{t=1}^{T}\ell(y_{t+1},\widehat{y}_{t+1})\) (or, equivalently, average loss \(\frac{1}{T}\sum_{t=1}^{T}\ell(y_{t+1},\widehat{y}_{t+1})\))
Sometimes also work with weighted average loss \(\sum_{t=1}^{T}w_t\ell(y_{t+1},\widehat{y}_{t+1})\)
- If \(w_t=\beta^t\), \(\beta\in(0,1)\), called exponential discounting: future profits discounted at interest rate if maximizing present value
This is on top of existing issue of needing to compare performance across many sequences \(\mathcal{Y}_t\)
Previous approaches: \(\mathcal{Y}_t\) comes from a distribution \(p(\mathcal{Y}_t)\), possibly in a family \(\mathcal{P}:=\{p(\mathcal{Y}_t,\theta),\theta\in\Theta\}\)
Probability approach allows handling fact that multiple outcomes are possible by focusing on average case loss
- Risk \(R_t(f)=E_{p}[\ell(y_{t+1},f_{t}(\mathcal{Y}_t))]\)
- Statistical approach looked for low risk for every \(p\in\mathcal{P}\)
- Bayesian approach looked for low risk on average over \(p\in{P}\), where average is with respect to prior \(\pi(\theta)\)
These approaches work fine if probability model good description of possible sequences we are likely to face
- Work best under stationarity or at least good model of changes over time
- Analyst judgment and intuition can and should be used to build model or method
Extend to sequential setting by considering average risk \(=\) expected average loss
- \(\frac{1}{T}\sum_{t=1}^{T}E_{p}[\ell(y_{t+1},\widehat{y}_{t+1})]=E_{p}[\frac{1}{T}\sum_{t=1}^{T}\ell(y_{t+1},\widehat{y}_{t+1})]\)

Worst Case Analysis and No Free Lunch Theorems

It would be even better if we had a method that can do well over all relevant sequences \(\underset{\mathcal{Y}_T}{\max}\sum_{t=1}^{T}\ell(y_{t+1},\widehat{y}_{t+1})\)
Without additional restrictions, impossible to get nontrivial performance bounds over all sequences
Consider case where \(\mathrm{Y}=\{0,1\}\) and \(\ell(y_{t+1},\widehat{y}_{t+1})=1\{y_{t+1}\neq \widehat{y}_{t+1}\}\). Total loss is number \(M\) of mistakes made.
Given sequence \(\widehat{y}_{t+1}=f_{t}(\mathcal{Y}_t)\in\{0,1\}\), consider sequence \(y_{t+1}=(1-f_t(\mathcal{Y}_t))\)
- Worst case total loss \(M=T\) for any method
Motivates randomizing: data fixed, but choices random
- Eg. flip a coin every period and guess 1 for Heads, 0 for Tails
- Then no matter what sequence is, expected total loss (with respect to own choices) is \(\frac{T}{2}\)
No Free Lunch theorems give converse relationship
- Any algorithm producing a \(\{0,1\}\) sequence has loss \(\frac{T}{2}\) on average over all sequences
- Corresponds to data playing the same coinflip strategy
Implication: without some assumptions on set of data faced, only trivial results possible
- Doing better against some sequences must mean doing worse against others
- With a little extra info, can focus on sequences that are relevant

Regret Analysis

Intermediate between worst and average case analysis, may single out set of sequences which matter more
Assess relative loss in comparison to some baseline
Let \(\mathcal{F}_t=\{f_{t,\theta}(): \otimes_{s=1}^{t}\mathrm{Y}\to\mathrm{Y}:\ \theta\in\Theta\}\) be a set of “baseline” predictors against which we want to perform well
Let \(\mathcal{A}\) be a strategy or algorithm: sequence of decision rules \(\{f_t(\mathcal{Y}_t)\}_{t=1}^{T}\) which may depend on the observations
The maximum total regret of strategy \(\mathcal{A}\) with respect to comparison class \(\mathcal{\Theta}\) is
- \(reg_{T}(\mathcal{A})=\underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{\theta,t}(.))\right\}\)
Average Regret is \(\frac{1}{T}reg_{T}(\mathcal{A})\)
Because regret is defined over all sequences, no probability distribution needed over sequences
- But, it can be helpful to randomize to compete with possibly adversarial sequences
A randomized algorithm/strategy \(\mathcal{A}=\{f_t(\mathcal{Y}_t,\omega_t)\}_{t=1}^{T}\) depends on auxiliary random variables \(\omega_t\sim q_t\)
If adversary can condition on strategy but not on realizations of randomness, measure of loss is expected regret
- \(\underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\mathbf{E}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.,\omega_t))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{\theta,t}(.))\right\}\)
- Here, expectation is only over randomness in own decisions

Interpretation of Regret

An algorithm with low regret performs nearly as well as the best rule in the baseline set in hindsight, after the sequence was seen, for any possible sequence
Per no free lunch result, total loss on some sequences may be (indeed, has to be) terrible
But if regret is low, this means that best rule in comparison class also did terribly on that sequence
- Conversely, whenever some rule does well on a sequence, a low regret strategy also does well
Low expected regret methods perform well relative to the best baseline rule on average, for any possible sequence
- Different draws of randomization may do better or worse
Regret captures situations where evaluation is against a benchmark
- Either explicitly as it is for hedge fund managers
- Or implicitly if you are competing with a group, and you can blame others for your mistakes
“Worldly wisdom teaches that it is better for reputation to fail conventionally than to succeed unconventionally.” Keynes (1936)
Less cynically, useful if you care more about accuracy in some sequences than others
- Different from probability, which captures this by “weights” these sequences
- May care about them more either because they are more likely or because performance in that case is important

Example 1: Prediction with Expert Advice

Want to predict sequence of binary outcomes \(y_{t+1}\in\{0,1\}\) with 0-1 loss
- User will click/not click on ad, Product will sell or not sell, Recession this quarter or no
We have access to a set of \(N\) “experts” \(\Theta=\{i\in 1\ldots N\}\) who each round give us their prediction \(f_{\theta,t}=\widehat{y}_{i,t+1}\in \{0,1\}\) of \(y_{t+1}\)
- Experts may be real human forecasters, like those surveyed by the Survey of Professional Forecasters or the Wall Street Journal Economic Survey
- Or they may be algorithms: eg. logistic regression, some bespoke Bayesian model, or some state-of-the-art machine learning thing
- Or just simple rules: Always 0, Always 1, Alternate, 1 on primes, etc
Goal is to use all this info to make a forecast, recognizing that some experts may be more accurate than others
Experts are “black boxes”: often impossible or too difficult to figure out where forecast comes from or why
- Only use history of past performance to evaluate
Best expert makes \(M^*=\underset{i\in1\ldots N}{min}\sum_{t=1}^{T}1\{y_{t+1}\neq \widehat{y}_{i,t+1}\}\) mistakes
Expected regret of an online learning procedure \(\mathcal{A}\) is \(\mathbf{E}\sum_{t=1}^{T}1\{y_{t+1}\neq \widehat{y}_{t+1}\}-M^*\)
Goal is to make not many more mistakes than best expert

Example 2: Online Regression

Let \(y_{t}\in\mathrm{Y}\) and \(\ell(.,.):\ \mathrm{Y}\times\mathrm{Y}\to B\subset\mathbb{R}\), \(B\) a bounded set
Boundedness may come from \(\mathbf{Y}\) bounded, as in outcomes known to take limited range
Needed for regret analysis, because if unbounded, worst case may achieve unbounded regret
- Worst case regret analysis useful if bad outcome is “customer doesn’t click the ad this time”
- Less so if worst outcome is “humanity perishes in nuclear war”
- Even though early theory came from David Blackwell at the RAND corporation, which supported early game and decision theory research for purpose of Cold War nuclear strategic planning
Set \(\Theta=\{\beta\in\mathbb{R}^d: \left\Vert\beta\right\Vert\leq C \}\) and \(\mathcal{F}_t=\{z_t^{\prime}\beta, \beta\in\Theta\}\)
- \(z_t\) may be external predictors, lagged values, etc
Regret of a procedure \(\mathcal{A}\) given by \(\underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\mathbf{E}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},z_t^\prime\beta)\right\}\)
Goal is to perform as well as best regression model in hindsight

Algorithms

What kind of algorithms will produce a low regret in a particular problem?
Ideally, we would find method that solves minimax regret problem \(\underset{\mathcal{A}}{\min}Reg_{T}(\mathcal{A})\)
Minimizes over all possible sequences of prediction rules the maximum over all sequences the loss relative to the best sequence-specific benchmark
- This is insanely hard computationally and nobody ever does it
Instead, introduce some popular simple methods and show that they perform reasonably
Method may be proper: choosing an element of the comparison set \(f_t\in\mathcal{F}_t\) each time
- E.g. prediction must be same as that of a particular expert, or must be linear
- If randomized, may choose a random element of \(\mathcal{F}_t\)
Or it may be improper: \(f_t\notin\mathcal{F}_t\)
- Using advice of all experts, combine in some way which is not the same as any particular expert
- Use possibly nonlinear function, and compare to best linear function
Will explore examples of both types
- Improper setting, which is less restricted, often yields stronger results, but properness sometimes a genuine restriction

Realizable Case

Some especially simple problems are realizable: the best possible predictor is in the comparison set \(\mathcal{F}_t\)
- \(\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{t,\theta})=0\)
- Regret minimization then equivalent to minimax case
Not realistic outside of settings where goal is to ascertain some unambiguous matter of fact
- But useful to show how analysis works
In expert advice setting, corresponds to \(M^*=0\): at least one expert knows the true sequence
- Goal is to find this expert as quickly with as few mistakes as possible
A possible algorithm to use here, with good properties, is Majority Vote
- Give each expert an initial weight \(w_{i,1}=1\), then at each t, do as follows:
- Choose 1 if \(\sum_{i=1}^{n}w_{i,t}1\{\widehat{y}_{i,t+1}=1\}>\sum_{i=1}^{n}w_{i,t}1\{\widehat{y}_{i,t+1}=0\}\), 0 otherwise
  - Vote with majority of experts
- After \(y_{t+1}\) realized, set weight of any expert \(i\) which made a mistake to 0
If we know best rule is exact, can ignore any rule as soon as it makes a mistake

Analysis of Majority Vote

Let \(W_t=\sum_{i=1}^{n}w_{i,t}\) be the number of remaining experts after \(t\) steps
If in period \(t\) a mistake is made, more than half of experts must have been wrong, so \(W_{t+1}\leq\frac{W_t}{2}\)
If \(M\) is total number of mistakes over \(T\) rounds then \(W_{T+1}\leq n 2^{-M}\)
Further \(W_{T+1}\geq 1\), since by realizability, at least one expert doesn’t make mistakes
Rearranging \(n 2^{-M}\geq 1\), \(M\leq \log_{2}n\)
- Regret satisfies bound \(reg_{T}(\mathcal{A})\leq\log_{2}n\)
Average regret \(\leq\frac{\log_{2}n}{T}\) goes to 0 as long as \(n\) not exponential in \(T\)
- Can deal with very large number of wrong predictors and still make few mistakes
In general, and certainly for a proper algorithm, vanishing average regret is best that can be hoped for
Compare method with minority vote: among those that haven’t made a mistake, choose smallest group
- Could remove as few as 1 expert per mistake, leading to up to \(n-1\) mistakes in worst case

Non-realizable Case

Outside of realizable case, problem much harder
Suppose expert set contains i,j s.t. \(f_{i,t}=0\) \(\forall t\) and \(f_{j,t}=1\) \(\forall t\)
Take any algorithm depending only on past data to pick outcome \(\widehat{y}_{t+1}\)
- Then sequence which always picks \(y_{t+1}=1-\widehat{y}_{t+1}\) is possible, giving \(T\) mistakes
Further, this sequence either has \(\leq\frac{1}{2}\) 1s or \(\leq\frac{1}{2}\) 0s
- So either \(i\) or \(j\) has loss \(\leq\frac{T}{2}\), respectively
Regret is \(\geq T-\frac{T}{2}=\frac{T}{2}\): allowing comparators does not make the worst case that much better
But, randomized algorithms can do fine in terms of expected regret
- Let \(p_{i,t}\) be probability of choosing expert \(i\) in round \(t\)
- Then Expected regret is \(\sum_{t=1}^{T}\sum_{i=1}^{n}p_{i,t}|y_{t+1}-\widehat{y}_{i,t+1}|-\underset{i}{\min}\sum_{t=1}^{T}|y_{t+1}-\widehat{y}_{i,t+1}|\)

Randomized Weighted Majority

Simple changes to Majority Vote lead to low expected regret in non-realizable case
Sample with probability proportional to weight
- \(p_{i,t}=\frac{w_{i,t}}{\sum_{i=1}^{n}w_{i,t}}\)
Since even best expert may make mistakes, don’t discard completely
- Instead, if expert \(i\) makes mistake, set \(w_{i,t+1}=(1-\epsilon)w_{i,t}\)
Analysis (cf Hazan 2015) similar to majority vote: let \(M(i)\) be # of mistakes of expert \(i\)
- \(W_T=\sum_{i=1}^{n}w_{i,T}\) has lower bound of \((1-\epsilon)^{M(i)}\)
- Can show upper bound of \(n\exp(-\epsilon\mathbf{E}M)\) in terms of total mistakes
Results imply regret \(\mathbf{E}M-\underset{i}{\min}M(i)\leq \epsilon T +\frac{\log n}{\epsilon}\)
Choosing \(\epsilon=\sqrt{\frac{\log n}{T}}\), upper bound is \(2\sqrt{T\log n}\)
Method has average expected regret \(\leq \frac{2\sqrt{\log N}}{\sqrt{T}}\)
- Goes to 0 with \(T\), and depends only weakly on number of experts

Claude Shannon’s Mind-Reading Machine

A simple example of online binary prediction: a machine that predicts user’s next choice, left or right
- If machine guesses right, it gets a point, if wrong, user gets a point
- Goal is to build machine that can win every time
In principle, user could guarantee \(\frac{T}{2}\) points and a tie just by random guessing
In practice, humans are really bad at randomness without tools like coins, dice, computers
With online algorithm, can build machine that usually wins, as Claude Shannon did at Bell labs in 1950s

Mind-reading machine, MIT museum, Photo: Poundstone (2014)

Hedge

Binary case can be generalized to experts making any kind of prediction, with arbitrary non-negative loss function
- Any output: discrete, scalar, vector, etc, all permitted
- Sufficient that \(\ell^2(.,.)\leq B\): bounded loss, eg due to bounded outcomes
Hedge algorithm takes in \(n\) experts, with initial weights \(w_{i,0}=1\)
Predict \(y_{t+1}\) using expert \(i\) forecast \(\widehat{y}_{i,t+1}\) with probability \(p_{i,t}=\frac{w_{i,t}}{\sum_{i=1}^nw_{i,t}}\)
Observe losses \(\ell_{i,t}=\ell(y_{t+1},\widehat{y}_{i,t+1})\) for each expert
Update weights to \(w_{i,t+1}=w_{i,t}\exp(-\epsilon\ell_{i,t})\) and repeat for next forecast
Analysis shows expected regret bounded by \(\epsilon\sum_{t=1}^{T}\sum_{i=1}^{n}p_{i,t}\ell^2_{i,t} +\frac{\log n}{\epsilon}\)
Choosing \(\epsilon=\sqrt{\frac{\log n}{BT}}\), upper bound is \(2\sqrt{B T\log n}\)
Regret grows sublinearly in \(T\), depends weakly on number of experts

Hedge: Applications

Hedge provides alternative approach to model selection
Can use for hyperparameter choice like AIC/BIC/Cross-validation etc
Allows combining input of large set of models of diverse types
- Choice depends on past performance only, not model structure
- Inputs can be simple or complicated, or even based on human judgment
- Can’t use AIC to compare VAR, random forest, and a CNBC pundit
Does not require stationarity or correct specification or existence of a probability distribution over outcomes
- Does require boundedness: useful for outcomes with known range
Guarantees performance near best in class on realized outcomes
- Contrast oracle guarantees: performance near best in class for risk
Rapidly (exponentially fast) downweights bad models, maintains weights on those with good performance
Essentially random near start of run: good performance coming from middle-to-end
- This is typical feature of online algorithms
- Choice of update speed \(\epsilon\) depends on horizon \(T\)

General Purpose Methods: Online Convex Optimization

Can we move beyond case of finite comparison class of “experts” to methods that apply to more general variable structures and comparison classes?
One large class of problems where this is feasible are online convex optimization problems
If realized loss functions \(\ell(y_{t+1},.)\) are convex in \(f_t(.)\), there are general purpose methods with strong regret guarantees
Convexity means \(\ell(y_{t+1},ax+(1-a)z)\leq a\ell(y_{t+1},x)+(1-a)\ell(y_{t+1},z)\) for any \(a\in[0,1]\)
- Bowl shaped functions
- Contains quadratic loss, absolute value loss, linear loss, etc
- Does not contain discrete case: randomization needed to make loss linear
Hedge and online regression with convex \(\ell\) are special cases of this setup
Next class: general algorithms and conditions
- Preview: Look a lot like (regularized) ERM

Conclusions

Online learning methods provide principled way to evaluate and implement stream of iteratively updated predictions
Regret based analyses ensure robustness against arbitrary sequences, relative to a set of comparison models
- An algorithm that produces low total expected regret over time can be trusted to have adequate performance regardless of what kind of new data comes in
For finite model sets, a useful class of procedures puts weights on the models based on past performance, then chooses randomly based on weights
- Hedge algorithm takes many models or experts as inputs, outputs low expected regret sequence of predictions
- Randomization critical, as without it worst case regret is linear in \(T\)

References

Elad Hazan Introduction to Online Convex Optimization Foundations and Trends in Optimization, vol. 2, no. 3-4 (2015) 157–325.
- Textbook on online learning, from an optimization theory perspective
John Maynard Keynes The General Theory of Employment, Interest, and Money (1936)
- Aside from introducing whole framework of 20th century macroeconomics, insightful chapters on investor psychology
Shai Shalev-Shwartz Online Learning and Online Convex Optimization Foundations and Trends in Machine Learning Vol. 4, No. 2 (2011) 107–194
- Shorter intro to online learning

Online Learning and Regret Minimization

73-423 Forecasting for Economics and Business

Outline

Motivation

Decision Theory: Review

Sequential Loss Evaluation

Worst Case Analysis and No Free Lunch Theorems

Regret Analysis

Interpretation of Regret

Example 1: Prediction with Expert Advice

Example 2: Online Regression

Algorithms

Realizable Case

Analysis of Majority Vote

Non-realizable Case

Randomized Weighted Majority

Claude Shannon’s Mind-Reading Machine

Hedge

Hedge: Applications

General Purpose Methods: Online Convex Optimization

Conclusions

References