Processing math: 16%
Outline
- Decision Theory Review
- Regret Minimization
- The Online Learning Setting
- Discrete Case: Prediction with Expert Advice
- Realizable Case
- Nonrealizable Case
- Application: The Mind Reading Machine
- Model Combination and Hedge
Motivation
- Often, forecasting is not a single unique event, but a process that must be repeated over and over
- Forecasts must be updated in light of new data, and business decisions made on an ongoing basis
- May even need to choose a procedure in which process is entirely automated
- Example use cases
- Web services: may need to update page every day or even every visitor
- Large retail operation: need to predict inventories for each outlet each day
- Want a procedure which is both automated and robust
- Not enough time for practitioner to apply judgment each time model must be changed
- If something goes wrong, might not be able to intervene immediately
- These goals motivate online learning methods
- A class of algorithms for making a sequence of repeated forecasting decisions
- A distinct mode of decision theoretic analysis focused on minimizing worst-case regret
Decision Theory: Review
- Choose a forecasting rule
- Observe sequence of data
- Apply forecasting rule to obtain predicted value
- Observe realized value yt+1
- Evaluate by loss function
- Loss is ℓ(yt+1,ˆyt+1)
- Sequential setup is exactly this, but repeated for t=0…T
- Need to compare losses across periods to compare T period sequence of decisions
Sequential Loss Evaluation
- Simplest approach: use total loss ∑Tt=1ℓ(yt+1,ˆyt+1) (or, equivalently, average loss 1T∑Tt=1ℓ(yt+1,ˆyt+1))
- Sometimes also work with weighted average loss ∑Tt=1wtℓ(yt+1,ˆyt+1)
- If wt=βt, β∈(0,1), called exponential discounting: future profits discounted at interest rate if maximizing present value
- This is on top of existing issue of needing to compare performance across many sequences Yt
- Previous approaches: Yt comes from a distribution p(Yt), possibly in a family P:={p(Yt,θ),θ∈Θ}
- Probability approach allows handling fact that multiple outcomes are possible by focusing on average case loss
- Risk Rt(f)=Ep[ℓ(yt+1,ft(Yt))]
- Statistical approach looked for low risk for every p∈P
- Bayesian approach looked for low risk on average over p∈P, where average is with respect to prior π(θ)
- These approaches work fine if probability model good description of possible sequences we are likely to face
- Work best under stationarity or at least good model of changes over time
- Analyst judgment and intuition can and should be used to build model or method
- Extend to sequential setting by considering average risk = expected average loss
- 1T∑Tt=1Ep[ℓ(yt+1,ˆyt+1)]=Ep[1T∑Tt=1ℓ(yt+1,ˆyt+1)]
Worst Case Analysis and No Free Lunch Theorems
- It would be even better if we had a method that can do well over all relevant sequences max
- Without additional restrictions, impossible to get nontrivial performance bounds over all sequences
- Consider case where \mathrm{Y}=\{0,1\} and \ell(y_{t+1},\widehat{y}_{t+1})=1\{y_{t+1}\neq \widehat{y}_{t+1}\}. Total loss is number M of mistakes made.
- Given sequence \widehat{y}_{t+1}=f_{t}(\mathcal{Y}_t)\in\{0,1\}, consider sequence y_{t+1}=(1-f_t(\mathcal{Y}_t))
- Worst case total loss M=T for any method
- Motivates randomizing: data fixed, but choices random
- Eg. flip a coin every period and guess 1 for Heads, 0 for Tails
- Then no matter what sequence is, expected total loss (with respect to own choices) is \frac{T}{2}
- No Free Lunch theorems give converse relationship
- Any algorithm producing a \{0,1\} sequence has loss \frac{T}{2} on average over all sequences
- Corresponds to data playing the same coinflip strategy
- Implication: without some assumptions on set of data faced, only trivial results possible
- Doing better against some sequences must mean doing worse against others
- With a little extra info, can focus on sequences that are relevant
Regret Analysis
- Intermediate between worst and average case analysis, may single out set of sequences which matter more
- Assess relative loss in comparison to some baseline
- Let \mathcal{F}_t=\{f_{t,\theta}(): \otimes_{s=1}^{t}\mathrm{Y}\to\mathrm{Y}:\ \theta\in\Theta\} be a set of “baseline” predictors against which we want to perform well
- Let \mathcal{A} be a strategy or algorithm: sequence of decision rules \{f_t(\mathcal{Y}_t)\}_{t=1}^{T} which may depend on the observations
- The maximum total regret of strategy \mathcal{A} with respect to comparison class \mathcal{\Theta} is
- reg_{T}(\mathcal{A})=\underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{\theta,t}(.))\right\}
- Average Regret is \frac{1}{T}reg_{T}(\mathcal{A})
- Because regret is defined over all sequences, no probability distribution needed over sequences
- But, it can be helpful to randomize to compete with possibly adversarial sequences
- A randomized algorithm/strategy \mathcal{A}=\{f_t(\mathcal{Y}_t,\omega_t)\}_{t=1}^{T} depends on auxiliary random variables \omega_t\sim q_t
- If adversary can condition on strategy but not on realizations of randomness, measure of loss is expected regret
- \underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\mathbf{E}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.,\omega_t))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{\theta,t}(.))\right\}
- Here, expectation is only over randomness in own decisions
Interpretation of Regret
- An algorithm with low regret performs nearly as well as the best rule in the baseline set in hindsight, after the sequence was seen, for any possible sequence
- Per no free lunch result, total loss on some sequences may be (indeed, has to be) terrible
- But if regret is low, this means that best rule in comparison class also did terribly on that sequence
- Conversely, whenever some rule does well on a sequence, a low regret strategy also does well
- Low expected regret methods perform well relative to the best baseline rule on average, for any possible sequence
- Different draws of randomization may do better or worse
- Regret captures situations where evaluation is against a benchmark
- Either explicitly as it is for hedge fund managers
- Or implicitly if you are competing with a group, and you can blame others for your mistakes
- “Worldly wisdom teaches that it is better for reputation to fail conventionally than to succeed unconventionally.” Keynes (1936)
- Less cynically, useful if you care more about accuracy in some sequences than others
- Different from probability, which captures this by “weights” these sequences
- May care about them more either because they are more likely or because performance in that case is important
Example 1: Prediction with Expert Advice
- Want to predict sequence of binary outcomes y_{t+1}\in\{0,1\} with 0-1 loss
- User will click/not click on ad, Product will sell or not sell, Recession this quarter or no
- We have access to a set of N “experts” \Theta=\{i\in 1\ldots N\} who each round give us their prediction f_{\theta,t}=\widehat{y}_{i,t+1}\in \{0,1\} of y_{t+1}
- Experts may be real human forecasters, like those surveyed by the Survey of Professional Forecasters or the Wall Street Journal Economic Survey
- Or they may be algorithms: eg. logistic regression, some bespoke Bayesian model, or some state-of-the-art machine learning thing
- Or just simple rules: Always 0, Always 1, Alternate, 1 on primes, etc
- Goal is to use all this info to make a forecast, recognizing that some experts may be more accurate than others
- Experts are “black boxes”: often impossible or too difficult to figure out where forecast comes from or why
- Only use history of past performance to evaluate
- Best expert makes M^*=\underset{i\in1\ldots N}{min}\sum_{t=1}^{T}1\{y_{t+1}\neq \widehat{y}_{i,t+1}\} mistakes
- Expected regret of an online learning procedure \mathcal{A} is \mathbf{E}\sum_{t=1}^{T}1\{y_{t+1}\neq \widehat{y}_{t+1}\}-M^*
- Goal is to make not many more mistakes than best expert
Example 2: Online Regression
- Let y_{t}\in\mathrm{Y} and \ell(.,.):\ \mathrm{Y}\times\mathrm{Y}\to B\subset\mathbb{R}, B a bounded set
- Boundedness may come from \mathbf{Y} bounded, as in outcomes known to take limited range
- Needed for regret analysis, because if unbounded, worst case may achieve unbounded regret
- Worst case regret analysis useful if bad outcome is “customer doesn’t click the ad this time”
- Less so if worst outcome is “humanity perishes in nuclear war”
- Even though early theory came from David Blackwell at the RAND corporation, which supported early game and decision theory research for purpose of Cold War nuclear strategic planning
- Set \Theta=\{\beta\in\mathbb{R}^d: \left\Vert\beta\right\Vert\leq C \} and \mathcal{F}_t=\{z_t^{\prime}\beta, \beta\in\Theta\}
- z_t may be external predictors, lagged values, etc
- Regret of a procedure \mathcal{A} given by \underset{\{y_{t+1}\}_{t=1}^{T}\in\otimes_{t=1}^{T}\mathrm{Y}}{\max}\mathbf{E}\left\{\sum_{t=1}^{T}\ell(y_{t+1},f_t(.))-\underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},z_t^\prime\beta)\right\}
- Goal is to perform as well as best regression model in hindsight
Algorithms
- What kind of algorithms will produce a low regret in a particular problem?
- Ideally, we would find method that solves minimax regret problem \underset{\mathcal{A}}{\min}Reg_{T}(\mathcal{A})
- Minimizes over all possible sequences of prediction rules the maximum over all sequences the loss relative to the best sequence-specific benchmark
- This is insanely hard computationally and nobody ever does it
- Instead, introduce some popular simple methods and show that they perform reasonably
- Method may be proper: choosing an element of the comparison set f_t\in\mathcal{F}_t each time
- E.g. prediction must be same as that of a particular expert, or must be linear
- If randomized, may choose a random element of \mathcal{F}_t
- Or it may be improper: f_t\notin\mathcal{F}_t
- Using advice of all experts, combine in some way which is not the same as any particular expert
- Use possibly nonlinear function, and compare to best linear function
- Will explore examples of both types
- Improper setting, which is less restricted, often yields stronger results, but properness sometimes a genuine restriction
Realizable Case
- Some especially simple problems are realizable: the best possible predictor is in the comparison set \mathcal{F}_t
- \underset{\theta\in\Theta}{\min}\sum_{t=1}^{T}\ell(y_{t+1},f_{t,\theta})=0
- Regret minimization then equivalent to minimax case
- Not realistic outside of settings where goal is to ascertain some unambiguous matter of fact
- But useful to show how analysis works
- In expert advice setting, corresponds to M^*=0: at least one expert knows the true sequence
- Goal is to find this expert as quickly with as few mistakes as possible
- A possible algorithm to use here, with good properties, is Majority Vote
- Give each expert an initial weight w_{i,1}=1, then at each t, do as follows:
- Choose 1 if \sum_{i=1}^{n}w_{i,t}1\{\widehat{y}_{i,t+1}=1\}>\sum_{i=1}^{n}w_{i,t}1\{\widehat{y}_{i,t+1}=0\}, 0 otherwise
- Vote with majority of experts
- After y_{t+1} realized, set weight of any expert i which made a mistake to 0
- If we know best rule is exact, can ignore any rule as soon as it makes a mistake
Analysis of Majority Vote
- Let W_t=\sum_{i=1}^{n}w_{i,t} be the number of remaining experts after t steps
- If in period t a mistake is made, more than half of experts must have been wrong, so W_{t+1}\leq\frac{W_t}{2}
- If M is total number of mistakes over T rounds then W_{T+1}\leq n 2^{-M}
- Further W_{T+1}\geq 1, since by realizability, at least one expert doesn’t make mistakes
- Rearranging n 2^{-M}\geq 1, M\leq \log_{2}n
- Regret satisfies bound reg_{T}(\mathcal{A})\leq\log_{2}n
- Average regret \leq\frac{\log_{2}n}{T} goes to 0 as long as n not exponential in T
- Can deal with very large number of wrong predictors and still make few mistakes
- In general, and certainly for a proper algorithm, vanishing average regret is best that can be hoped for
- Compare method with minority vote: among those that haven’t made a mistake, choose smallest group
- Could remove as few as 1 expert per mistake, leading to up to n-1 mistakes in worst case
Non-realizable Case
- Outside of realizable case, problem much harder
- Suppose expert set contains i,j s.t. f_{i,t}=0 \forall t and f_{j,t}=1 \forall t
- Take any algorithm depending only on past data to pick outcome \widehat{y}_{t+1}
- Then sequence which always picks y_{t+1}=1-\widehat{y}_{t+1} is possible, giving T mistakes
- Further, this sequence either has \leq\frac{1}{2} 1s or \leq\frac{1}{2} 0s
- So either i or j has loss \leq\frac{T}{2}, respectively
- Regret is \geq T-\frac{T}{2}=\frac{T}{2}: allowing comparators does not make the worst case that much better
- But, randomized algorithms can do fine in terms of expected regret
- Let p_{i,t} be probability of choosing expert i in round t
- Then Expected regret is \sum_{t=1}^{T}\sum_{i=1}^{n}p_{i,t}|y_{t+1}-\widehat{y}_{i,t+1}|-\underset{i}{\min}\sum_{t=1}^{T}|y_{t+1}-\widehat{y}_{i,t+1}|
Randomized Weighted Majority
- Simple changes to Majority Vote lead to low expected regret in non-realizable case
- Sample with probability proportional to weight
- p_{i,t}=\frac{w_{i,t}}{\sum_{i=1}^{n}w_{i,t}}
- Since even best expert may make mistakes, don’t discard completely
- Instead, if expert i makes mistake, set w_{i,t+1}=(1-\epsilon)w_{i,t}
- Analysis (cf Hazan 2015) similar to majority vote: let M(i) be # of mistakes of expert i
- W_T=\sum_{i=1}^{n}w_{i,T} has lower bound of (1-\epsilon)^{M(i)}
- Can show upper bound of n\exp(-\epsilon\mathbf{E}M) in terms of total mistakes
- Results imply regret \mathbf{E}M-\underset{i}{\min}M(i)\leq \epsilon T +\frac{\log n}{\epsilon}
- Choosing \epsilon=\sqrt{\frac{\log n}{T}}, upper bound is 2\sqrt{T\log n}
- Method has average expected regret \leq \frac{2\sqrt{\log N}}{\sqrt{T}}
- Goes to 0 with T, and depends only weakly on number of experts
Claude Shannon’s Mind-Reading Machine
- A simple example of online binary prediction: a machine that predicts user’s next choice, left or right
- If machine guesses right, it gets a point, if wrong, user gets a point
- Goal is to build machine that can win every time
- In principle, user could guarantee \frac{T}{2} points and a tie just by random guessing
- In practice, humans are really bad at randomness without tools like coins, dice, computers
- With online algorithm, can build machine that usually wins, as Claude Shannon did at Bell labs in 1950s
Photo: Poundstone (2014)
Hedge
- Binary case can be generalized to experts making any kind of prediction, with arbitrary non-negative loss function
- Any output: discrete, scalar, vector, etc, all permitted
- Sufficient that \ell^2(.,.)\leq B: bounded loss, eg due to bounded outcomes
- Hedge algorithm takes in n experts, with initial weights w_{i,0}=1
- Predict y_{t+1} using expert i forecast \widehat{y}_{i,t+1} with probability p_{i,t}=\frac{w_{i,t}}{\sum_{i=1}^nw_{i,t}}
- Observe losses \ell_{i,t}=\ell(y_{t+1},\widehat{y}_{i,t+1}) for each expert
- Update weights to w_{i,t+1}=w_{i,t}\exp(-\epsilon\ell_{i,t}) and repeat for next forecast
- Analysis shows expected regret bounded by \epsilon\sum_{t=1}^{T}\sum_{i=1}^{n}p_{i,t}\ell^2_{i,t} +\frac{\log n}{\epsilon}
- Choosing \epsilon=\sqrt{\frac{\log n}{BT}}, upper bound is 2\sqrt{B T\log n}
- Regret grows sublinearly in T, depends weakly on number of experts
Hedge: Applications
- Hedge provides alternative approach to model selection
- Can use for hyperparameter choice like AIC/BIC/Cross-validation etc
- Allows combining input of large set of models of diverse types
- Choice depends on past performance only, not model structure
- Inputs can be simple or complicated, or even based on human judgment
- Can’t use AIC to compare VAR, random forest, and a CNBC pundit
- Does not require stationarity or correct specification or existence of a probability distribution over outcomes
- Does require boundedness: useful for outcomes with known range
- Guarantees performance near best in class on realized outcomes
- Contrast oracle guarantees: performance near best in class for risk
- Rapidly (exponentially fast) downweights bad models, maintains weights on those with good performance
- Essentially random near start of run: good performance coming from middle-to-end
- This is typical feature of online algorithms
- Choice of update speed \epsilon depends on horizon T
General Purpose Methods: Online Convex Optimization
- Can we move beyond case of finite comparison class of “experts” to methods that apply to more general variable structures and comparison classes?
- One large class of problems where this is feasible are online convex optimization problems
- If realized loss functions \ell(y_{t+1},.) are convex in f_t(.), there are general purpose methods with strong regret guarantees
- Convexity means \ell(y_{t+1},ax+(1-a)z)\leq a\ell(y_{t+1},x)+(1-a)\ell(y_{t+1},z) for any a\in[0,1]
- Bowl shaped functions
- Contains quadratic loss, absolute value loss, linear loss, etc
- Does not contain discrete case: randomization needed to make loss linear
- Hedge and online regression with convex \ell are special cases of this setup
- Next class: general algorithms and conditions
- Preview: Look a lot like (regularized) ERM
Conclusions
- Online learning methods provide principled way to evaluate and implement stream of iteratively updated predictions
- Regret based analyses ensure robustness against arbitrary sequences, relative to a set of comparison models
- An algorithm that produces low total expected regret over time can be trusted to have adequate performance regardless of what kind of new data comes in
- For finite model sets, a useful class of procedures puts weights on the models based on past performance, then chooses randomly based on weights
- Hedge algorithm takes many models or experts as inputs, outputs low expected regret sequence of predictions
- Randomization critical, as without it worst case regret is linear in T
References
- Elad Hazan Introduction to Online Convex Optimization Foundations and Trends in Optimization, vol. 2, no. 3-4 (2015) 157–325.
- Textbook on online learning, from an optimization theory perspective
- John Maynard Keynes The General Theory of Employment, Interest, and Money (1936)
- Aside from introducing whole framework of 20th century macroeconomics, insightful chapters on investor psychology
- Shai Shalev-Shwartz Online Learning and Online Convex Optimization Foundations and Trends in Machine Learning Vol. 4, No. 2 (2011) 107–194
- Shorter intro to online learning