Evaluating Forecasting Methods

Plans

Formalize problem of forecasting using decision theory
A way to analyze decision-making processes
- Tells us what a good forecasting rule looks like
- Not just quality of result, but quality of method producing results
- May or may not tell us how to find such a good rule
Decision theory is not just one theory
- Many value judgments required
- Application depends on context
- Gives us a formal language to describe problem features that should affect decision
Today
- Outline setup,
- Talk about a few ways to think about what is a good decision
Future
- Apply these criteria to introduce different types of “good” forecasting rules according to different criteria and context

Decision Theory: Setup

Choose a forecasting rule \[f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}\]
Observe sequence of data \[\mathcal{Y}_T=\{y_t\}_{t=1}^{T}\in \otimes_{t=1}^{T} \mathrm{Y}\]
Apply forecasting rule to obtain predicted value \[\widehat{y}_{T+h}=f(\mathcal{Y}_T)\]
Observe realized value \(y_{T+h}\)
Evaluate by loss function \[\ell(.,.): \mathrm{Y}\times\mathrm{Y}\to\mathbb{R}^+\]
Loss is \[\ell(y_{T+h},\widehat{y}_{T+h})\]

Decision Theory: Goals

Goal of decision theory is to choose a decision rule \(f\) that leads to small loss
“Solution”: Pick the rule that gives the correct answer
- \(f^{*}(\mathcal{Y}_T)=y_{T+h}\) solves \(\underset{f}{\arg\min\ } \ell(y_{T+h},f(\mathcal{Y}_T))\)
If such a rule exists and you know what it is, great
- Usually, neither is true!

Sequences

Consider space of possible sequences \(\mathcal{Y}_{T+h}=\{y_t\}_{t=1}^{T+h}\in \otimes_{t=1}^{T+h} \mathrm{Y}\)
\(\otimes_{t=1}^{T+h} \mathrm{Y}\) contains any possible outcomes for results
Example: forecast if stock market will go up or down tomorrow, based on past moves up or down
Then \(\mathrm{Y}=\{\text{up, down}\}\), and if T=3, h=1, possibilities are
- {up, up, up, up}
- {down, down, down, down}
- {up, down, up, down}
- {up, up, up, down}
- etc (12 more)
Considering all possibilities, try to find a rule that has small loss over the different sequences
Issue: with many possible sequences, different rules will have different loss on each
Possible solutions: one or more of
- Decide which sequences are “more important”
- Change the goal of the problem

Sequences

t<-c(1,2,3,4)
y1<-c(0.75,0.75,0.75,0.75)
y2<-c(0.5,0.5,0.5,0.5)
y3<-c(0.25,0.25,0.25,0.25)
y4<-c(0,0,0,0)
# y5<-c(-0.25,-0.25,-0.25,-0.25)
# y6<-c(-0.5,-0.5,-0.5,-0.5)
# y7<-c(-0.75,-0.75,-0.75,-0.75)
# y8<-c(-1,-1,-1,-1)
seq1<-c(1,1,1,1)
seq2<-c(0,0,0,0)
seq3<-c(1,0,1,0)
seq4<-c(1,1,1,0)
# seq5<-c(0,1,1,1)
# seq6<-c(0,1,1,0)
# seq7<-c(0,1,0,1)
# seq8<-c(0,1,0,0)
dframe<-data.frame(t,y1,y2,y3,y4,seq1,seq2,seq3,seq4)
library(ggplot2)
ggplot(data=dframe,mapping=aes(x=t))+
  geom_point(aes(y=y1,color=seq1),size=10)+
  geom_point(aes(y=y2,color=seq2),size=10)+
  geom_point(aes(y=y3,color=seq3),size=10)+
  geom_point(aes(y=y4,color=seq4),size=10)+
  theme(panel.background = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(),legend.position = "none")+
  labs(x=" ", y=" ",title="Example Sequences")

Preview of approaches

Different decision theoretic philosophies approach tradeoffs in different ways
Average vs worst case
- Can try to do reasonably well in any situation, even the worst
- Can try to do well over many situations, with different priority given to each
  - Have different degrees of of prioritizing
Absolute vs relative
- Can use a criterion which judges forecast only by its own loss
- Can instead try to do well relative to some benchmark (which may do better or worse depending on the situation)
Average case approaches can fail badly in cases they assign low weight to
Worst case methods will never do terribly, but may never do particularly well either
Absolute criteria try to do okay in any situation
Relative criteria are allowed to perform badly in “hard” cases, must do well in easy cases

A Crude Classification of Philosophies

library(knitr)
library(kableExtra)
text_tbl <- data.frame(
  Criteria = c("Absolute Criterion", "Relative Criterion"),
  Worst.Case = c(
    "Adversarial Prediction", 
    "Online Learning "
    ),
  Average.Average.Case = c(
    "Bayesianism",
    " Bayesian Regret* " 
    ),
  Worst.Average.Case = c(
    "Minimax Theory",
    "Statistical Learning" 
    )  
)

kable(text_tbl) %>%
  kable_styling(full_width = F) %>%
  column_spec(1,border_right = T) %>%
  column_spec(2) %>%
  add_header_above(c(" " = 2, "Average Case" = 2))

		Average Case
Criteria	Worst.Case	Average.Average.Case	Worst.Average.Case
Absolute Criterion	Adversarial Prediction	Bayesianism	Minimax Theory
Relative Criterion	Online Learning	Bayesian Regret*	Statistical Learning

*rarely used, so will not be covered

Average case analysis: probability

One way to quantify “more important” is to give a weighting to each sequence
Most commonly, this is done by means of probability
- Roughly, a set of nonnegative weights that sum to 1
Probability assigns to each subset \(A:=\{\omega\in \Omega: \omega\in A\}\) of a (discrete) set \(\Omega\) a number \(p(A)\) such that
- \(0\leq p(A) \leq 1\)
- \(p(\Omega)=1\)
- If \(A_1,A_2,\ldots\) are disjoint sets in \(\Omega\), \(p(\cup_{j}A_j)=\sum_j p(A_j)\)
For continuous case, probability defined by a density \(p()\) which integrates to 1
For forecasting, the set over which we define a probability is the set of sequences \(\otimes_{t=1}^{T+h}\mathrm{Y}\)
Once a probability is defined, a sequence is a random variable
- A function of the event \(\omega\)
E.g. might give weights \((1/4,1/4,1/4,1/4)\) to sequences
- \(\{1,1,1,1\}\), \(\{0,0,0,0\}\), \(\{1,0,1,0\}\), \(\{1,1,1,0\}\)
- Any two events have combined probability 1/2, any three \(3/4\), etc

Expected values and Risk

We consider sequences with higher weight \(p\) more important to worry about
Typical application is to calculate expected value (with respect to \(p\)) of some function of the sequence \[E_p[g(\mathcal{Y}_{T+h})]:=\sum_{\omega\in\Omega} p(\omega)g(\mathcal{Y}_{T+h}(\omega))\]
For continuous case, use integral \[E_p[g(\mathcal{Y}_{T+h})]:=\int_{\omega\in\Omega} g(\mathcal{Y}_{T+h}(\omega))p(\omega)d\omega\]
Most important expectation is expectation of loss function, called the risk \[R_p(f):=E_p[\ell(y_{T+h},f(\mathcal{Y}_T))]\]
Given a distribution, we can now compare forecast methods based on a single number
- Smaller risk means smaller loss, on average

Average-Case decision-making

Choosing a rule which is low risk tells us that our guess will do well across possible sequences
Weights tell us about subjective importance of different sequences
High probability sequences considered “more likely”
- One interpretation: given a large number of observed sequences, fraction \(p(A)\) will be in set \(A\)
- In general, do not actually observe many sequences, so this is purely theoretical
You may not do well in every case, but you do relatively well in cases that are “more common”
Risk, with distribution \(p\), allows us to define a best possible outcome, called the Bayes Risk \[R_p^{*}:=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\inf}R_p(f)\]
If a minimizer exists, it is called the Bayes Forecast

Bayes forecasts

When Bayes forecast exists, defined as average-case optimal forecast
- \(f_p^*(.):=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\arg\min}E_p\ell(y_{T+h},f(\mathcal{Y}_T))\)
Analysis can be simplified using conditional probability
- For any two events \(A,B\), \(p(B|A):=\frac{p(B\cap A)}{p(A)}\)
- Conditional densities \(p(y|x):=\frac{p(y,x)}{p(x)}\)
Can decompose multivariate expectation as average of conditional expectation
- \(E_p[g(x,y)]=\int\int g(x,y)p(y|x)p(x)dydx=E_{p(x)}[E_{p(y|x)}[g(x,y)|x]]\)
Applying to risk, and considering \(h=1\) for simplicity
- \(f_p^*(.):=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\arg\min}\int\int\ell(y_{T+1},f(\mathcal{Y}_T))p(y_{T+1}|\mathcal{Y}_T)p(\mathcal{Y}_T)dy_{T+1}d\mathcal{Y}_T\)
- \(=\int\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\arg\min}[\int\ell(y_{T+1},f(\mathcal{Y}_T))p(y_{T+1}|\mathcal{Y}_T)dy_{T+1}]p(\mathcal{Y}_T)d\mathcal{Y}_T\)
So \(f_p^*(\mathcal{Y}_T)=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\arg\min}[\int\ell(y_{T+1},f(\mathcal{Y}_T))p(y_{T+1}|\mathcal{Y}_T)dy_{T+1}]\)
- Bayes forecast minimizes conditional risk: conditional expectation of loss given observed data

Application: Sequences

Go back to \({up,down}\) prediction problem
Sequences \(\{1,1,1,1\}\), \(\{0,0,0,0\}\), \(\{1,0,1,0\}\),\(\{1,1,1,0\}\) w/ prob \((1/4,1/4,1/4,1/4)\)
Let loss function be 0-1 loss: \(\ell(y,\widehat{y})=1\{y\neq\widehat{y}\}\)
Suppose we observe \(\mathcal{Y}_T=\{1,0,1\}\)
- Then \(p(y_{T+1}=0|\mathcal{Y}_T)=\frac{p(1,0,1,0)}{p(1,0,1,0)+p(1,0,1,1)}=\frac{1/4}{1/4+0}=1\)
- \(p(y_{T+1}=1|\mathcal{Y}_T)=\frac{p(1,0,1,1)}{p(1,0,1,0)+p(1,0,1,1)}=\frac{0}{1/4+0}=0\)

ggplot(data=dframe,mapping=aes(x=t))+
  geom_point(aes(y=y1,color=seq1),size=10,alpha=0.1)+
  geom_point(aes(y=y2,color=seq2),size=10,alpha=0.1)+
  geom_point(aes(y=y3,color=seq3),size=10)+
  geom_point(aes(y=y4,color=seq4),size=10,alpha=0.1)+
  theme(panel.background = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(),legend.position = "none")+
  labs(x=" ", y=" ",title="Sequences After Conditioning")

Sequence Example, ctd

Conditional Risk is \(1\times \ell(0,f(\mathcal{Y}_T))+0\times \ell(1,f(\mathcal{Y}_T))=\ell(0,f(\mathcal{Y}_T))=1\{0\neq f(\mathcal{Y}_T)\}\)
Minimize by setting \(f_p^*(\mathcal{Y}_T)=0\)
Similarly, for \(\mathcal{Y}_T=\{0,0,0\}\), \(f_p^*(\mathcal{Y}_T)=0\)
Observing \(\mathcal{Y}_T=\{1,1,1\}\), conditional risk is \(\frac{1}{2}\times 1\{0\neq f(\mathcal{Y}_T)\}+\frac{1}{2}\times 1\{1\neq f(\mathcal{Y}_T)\}\)
- Since multiple outcomes given equal weight, any prediction \(f_p^*(\mathcal{Y}_T)\in\{0,1\}\) is optimal

ggplot(data=dframe,mapping=aes(x=t))+
  geom_point(aes(y=y1,color=seq1),size=10,alpha=0.7)+
  geom_point(aes(y=y2,color=seq2),size=10,alpha=0.1)+
  geom_point(aes(y=y3,color=seq3),size=10,alpha=0.1)+
  geom_point(aes(y=y4,color=seq4),size=10,alpha=0.7)+
  theme(panel.background = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(),legend.position = "none")+
  labs(x=" ", y=" ",title="Sequences After Conditioning")

For outcomes with probability 0, like \(\{0,1,0\}\), again any prediction is optimal: changing it has no effect on risk

Worst case analysis of sequences

To express extreme agnosticism, can try to find methods that work for all sequences, rather than on average over a distribution on sequences
One potentially desirable goal would be to find a rule which performs well for all sequences in \(\otimes_{t=1}^{T+h} \mathrm{Y}\)
That is, consider worst case loss over all sequences of decision rule \(f\) \[L^{\max}(f)=\underset{\mathcal{Y}_{T+h}\in\otimes_{t=1}^{T+h} \mathrm{Y}}{\sup}\ell(y_{T+h},f(\mathcal{Y}_T))\]
This criterion then tells you loss of a “best” rule \[L^{\min\max}=\underset{f}{\inf}\underset{\mathcal{Y}_{T+h}\in\otimes_{t=1}^{T+h} \mathrm{Y}}{\sup}\ell(y_{T+h},f(\mathcal{Y}_T))\]
In words, this says: pick a rule that has the best worst case outcome over any sequence
This can be wildly pessimistic
- Does well in sequences where loss might be high, but may do quite poorly in sequences where loss may be low
Also may not define rule uniquely

Adversarial Decision-Making

There are ways to consider worse cases than even the worst sequence
May have an outcome that responds to decision rule chosen, in a way which is adversarial
- The sequence adjusts to make outcome as bad as possible.
This kind of analysis might make sense if outcome decided by malicious adversary
- A competitor (in business, politics, or games) wants you to do poorly
- Indeed, it comes from game theory, where this is the situation
Standard in game theory to work with slightly modified spaces of rules: randomized strategies
- Choose random rule, and allow adversary to respond only to distribution
- Goal is to minimize expected loss, over own distribution choice \(\widehat{y}_{T+h}\sim \widehat{p}_{T+h}()\)
- Adversary also allowed to randomize \(y_{T+h}\sim q_{T+h}\) \[L^{Adversarial}= \underset{\widehat{p}_{T+h}()}{\inf}\underset{q_{T+h}}{\sup}E_{\widehat{p}_{T+h}(),q_{T+h}}\ell(y_{T+h},\widehat{y}_{T+h})\]
In this setting, randomized algorithms can help by restricting the ability to respond of the adversary

Example: Sequences

Let loss function be 0-1 loss: \(\ell(y,\widehat{y})=1\{y\neq\widehat{y}\}\)
If all possible length 4 \(\{0,1\}\) sequences are in set considered, for any rule, always possible that next outcome is other outcome
\(L^{\min\max}=1\)
- Cannot prevent mistakes in at least some cases
\(L^{Adversarial}=\frac{1}{2}\)
- By guessing at random, any strategy of adversary will give wrong guess 1/2 the time
- But cannot do better than guessing at random

Online Learning and Regret Minimization

Worst case and adversarial analysis do not let you take advantage of particular features of sequence
- Usually obtain trivial results
Want good relative performance over sequences
- Attempt to do better on some sequences while allowing doing worse on others
Asks for regret of learning rule
- Given set \(\mathcal{F}\) of comparison rules \(f():\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}\)
- Want rule which does not much worse than the rule in that set which is best ex post, after sequence is observed
- Usually compare over full sequence, where rule is updated each period \[Regret(\{f_t\}_{t=1}^{T})=\sum_{t=1}^{T}\ell(y_{t},f_t(\mathcal{Y}_{t-1}))-\underset{f\in\mathcal{F}}{\inf}\sum_{t=1}^{T}\ell(y_{t},f(\mathcal{Y}_{t-1}))\]

Applying Online Learning

Usual goal is to minimize maximum regret over any possible sequence \(\{y_t\}_{t=1}^{T}\)
- Rules \(f_t\) are allowed to adapt to the sequence, and in some cases, be randomized
Sequence may be allowed to adapt to (distribution) of rules
Best possible result is
- \(\underset{f_T\sim p}{\inf}\underset{y_T\sim q}{\sup}\ldots\underset{f_T\sim p}{\inf}\underset{y_T\sim q}{\sup}Regret(\{f_t\}_{t=1}^{T})\)
Smaller comparator class makes low regret easier, but allows worse absolute performance
Could find good rule by solving long sequence of optimization problems
In practice, some very simple rules are used, and shown to work well

Refining the average case approach: Probability Models

Given a distribution \(p\), Bayes forecast tells us exactly what to do to get low risk
Problem: Where does \(p\) that we are averaging over come from?
- Need background knowledge on which outcomes are likely
Can we relax this requirement?
One way: allow the probability distribution to be unknown
Allow a family of probability distributions
- \(\{p(\mathcal{Y}_{T+h},\theta)\}_{\theta\in\Theta}\)
Call such a family a (probability) model
Different values of \(\theta\) can correspond to different features of sequence
- E.g., higher or lower values, trend, seasonality, variance and autocovariance, etc can be more or less likely
With many probability distributions \(p_{\theta}\), have not one but many risk functions
- \(R_{p_{\theta}}(f):=\int \ell(y_{T+h},f(\mathcal{Y}_T))p(\mathcal{Y}_{T+h},\theta)d\mathcal{Y}_{T+h}\)
Now different forecast rules will have risk which depends on \(\theta\)

Example Probability Models

Any distribution over sequences can be part of a probability model
Most general model: Set of all probability distributions
- Seems very flexible, but will be too large for most things we want to do with a model
Alternative: all probability distributions that factorize
- \(\Theta=\{p: p(\mathcal{Y}_{T+h})=\Pi_{t=1}^{T+h}p(y_t)\}\)
- Such distributions are called iid - independent and identically distributed
- Almost every stats and ML class you take looks only at these
- In time series, called “White Noise”: uninteresting model
Simple non-iid model: Normal-AR(1) Model
- Let \(\phi(x)=\frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}x^2)\) be standard normal density
- \(p(\mathcal{Y}_{T+h},\theta)=\phi(y_1)\cdot\Pi_{t=2}^{T+h}\phi(y_t-\theta y_{t-1})\)
- Recursively, \(y_{t}=\theta y_{t-1}+e_t\), \(e_t\overset{iid}{\sim} N(0,1)\)
Will introduce many more models later in class
- Many related to simple forecasting rules we have covered

Working with probability models

How can we assess forecasts with many distributions?
- No longer a single fixed standard
Need some way of evaluation across distributions
Things to consider
- Set of rules
- Set of risk functions
- Combination of different risk functions
Different choices generate different approaches
- Minimax Theory
- Statistical Learning Theory
- Bayesianism

Minimax Theory

Suppose we have a model class we think is useful
But want to remain completely agnostic on which model in class is worthwhile
Can evaluate forecast rule \(f\) by Maximum Risk \[\underset{\theta\in\Theta}\max R_{p_{\theta}}(f)\]
A decision rule which performs well according to this criterion gives good performance on average for any distribution in a class
Gives a single measure of quality, which averages over sequences according to a distribution
- But considers the worst case distribution when doing so

Applying Minimax Theory

With a single measure of quality, easy to define a “best” result: the Minimax risk \[R^{\text{Minimax}}=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\inf}\underset{\theta\in\Theta}{\sup} R_{p_{\theta}}(f)\]
Size of set \(\{p_{\theta}\}_{\theta\in\Theta}\) determines tradeoff
- Large set of models implies good performance in more situations
- Small set of models improves performance within that set, since don’t have to worry about as many models
In principle, could perform optimization and find minimizer
- In practice, hard to find, but common to come up with reasonable looking rule and show it comes close to minimizing

Statistical Learning

Typical frequentist estimation tries to find methods which work well for all distributions in a class
Sometimes helpful to compare predictions to best possible in some class \(\mathcal{F}\)
- So-called “oracle predictor”
- \(R_{p}^{\text{Oracle}}=\underset{f\in \mathcal{F}}{\inf}E_{p}\ell(y_{t+h},f(\mathcal{Y}_T))\) is best risk achievable with p known
Goal is to find a procedure which does nearly as well as the oracle, even when p isn’t known
Worst case relative risk of \(f\) is \[\underset{{\theta}\in\Theta}{\sup}(E_{p_{\theta}}\ell(y_{t+h},f(\mathcal{Y}_T))-R_{p_{\theta}}^{\text{Oracle}})\]

Appplying Statistical Learning

Relative risk criterion allows for bad performance in hard cases where oracle also fails
Contrast to minimax risk
- Asks for good performance in these cases, which may degrade performance in easier cases
Smaller set \(\mathcal{F}\) allows for better relative performance
- But allows for worse absolute performance
Larger set \(\Theta\) makes method more robust to different model features
- In ML, often allow all iid distributions
- Can extend result to larger classes to accommodate time series
  - But smaller than all distributions over sequences
Again, could simply optimize directly to find a good rule
- In practice, pick simple method and prove it comes close

Average case over distributions

Again consider case where we have a probability model \(\{p(\mathcal{Y}_{T+h},\theta)\}_{\theta\in\Theta}\)
Minimax risk may be too pessimistic: assigns a lot of weight to “bad” distributions
Instead may prefer to average over distibutions in model
Average over distributions must be with respect to some distribution \(\pi(\theta)\)
- \(\pi(\theta)\) is called a prior distribution
Average risk of a rule \(f\) is then \(R_{\pi}(f)=E_{\pi}[R_{p_{\theta}}(f)]\)
- \(=\int[\int \ell(y_{T+h},f(\mathcal{Y}_T))p(\mathcal{Y}_{T+h},\theta)d\mathcal{Y}_{T+h}]\pi(\theta)d\theta\)
Applying Fubini’s theorem (saying you can change order of integrands)
- \(=\int \ell(y_{T+h},f(\mathcal{Y}_T))[\int p(\mathcal{Y}_{T+h},\theta)\pi(\theta)d\theta]d\mathcal{Y}_{T+h}\)
In other words, averaging case over models just gives you back average case over sequences
- Distribution now \(\int p(\mathcal{Y}_{T+h},\theta)\pi(\theta)d\theta\)
- Optimal risk is still the Bayes Risk, optimal forecast still the Bayes forecast
Choosing a prior and computing the Bayes forecast will be subject of future classes on Bayesian methods
- Sneak Preview: It will involve calculating a whole lot of integrals

Criteria vs method

All of above are ways of evaluating what a “good” forecast is
Any forecasting method can be evaluated in these ways
- Once choices are made: loss function, models/hypotheses, distributions
Sometimes, criterion tells you enough to find a “best” method
- Once chosen, just apply the best method
Other times, apply method which is “good enough”

Future Classes: Forecasting theory

Given a method + a criterion, show method does well by criterion
Huge complicated area with a lot of math
Class will contain many examples of methods + results
Not systematic
- Grab bag of things that seem to work well
- Plus some justifications of when, why, how
Start with “classical” statistical approach

Evaluating Forecasting Methods

73-423 Forecasting for Economics and Business

Plans

Decision Theory: Setup

Decision Theory: Goals

Sequences

Sequences

Preview of approaches

A Crude Classification of Philosophies

Average case analysis: probability

Expected values and Risk

Average-Case decision-making

Bayes forecasts

Application: Sequences

Sequence Example, ctd

Worst case analysis of sequences

Adversarial Decision-Making

Example: Sequences

Online Learning and Regret Minimization

Applying Online Learning

Refining the average case approach: Probability Models

Example Probability Models

Working with probability models

Minimax Theory

Applying Minimax Theory

Statistical Learning

Appplying Statistical Learning

Average case over distributions

Criteria vs method

Future Classes: Forecasting theory