Plans
- Formalize problem of forecasting using decision theory
 
- A way to analyze decision-making processes
- Tells us what a good forecasting rule looks like
 
- Not just quality of result, but quality of method producing results
 
- May or may not tell us how to find such a good rule
 
 
- Decision theory is not just one theory
- Many value judgments required
 
- Application depends on context
 
- Gives us a formal language to describe problem features that should affect decision
 
 
- Today
- Outline setup,
 
- Talk about a few ways to think about what is a good decision
 
 
- Future
- Apply these criteria to introduce different types of “good” forecasting rules according to different criteria and context
 
 
 
Decision Theory: Setup
- Choose a forecasting rule \[f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}\]
 
- Observe sequence of data \[\mathcal{Y}_T=\{y_t\}_{t=1}^{T}\in \otimes_{t=1}^{T} \mathrm{Y}\]
 
- Apply forecasting rule to obtain predicted value \[\widehat{y}_{T+h}=f(\mathcal{Y}_T)\]
 
- Observe realized value \(y_{T+h}\)
 
- Evaluate by loss function \[\ell(.,.): \mathrm{Y}\times\mathrm{Y}\to\mathbb{R}^+\]
 
- Loss is \[\ell(y_{T+h},\widehat{y}_{T+h})\]
 
 
Decision Theory: Goals
- Goal of decision theory is to choose a decision rule \(f\) that leads to small loss
 
- “Solution”: Pick the rule that gives the correct answer
- \(f^{*}(\mathcal{Y}_T)=y_{T+h}\) solves \(\underset{f}{\arg\min\ } \ell(y_{T+h},f(\mathcal{Y}_T))\)
 
 
- If such a rule exists and you know what it is, great
- Usually, neither is true!
 
 
 
Sequences
- Consider space of possible sequences \(\mathcal{Y}_{T+h}=\{y_t\}_{t=1}^{T+h}\in \otimes_{t=1}^{T+h} \mathrm{Y}\)
 
- \(\otimes_{t=1}^{T+h} \mathrm{Y}\) contains any possible outcomes for results
 
- Example: forecast if stock market will go up or down tomorrow, based on past moves up or down
 
- Then \(\mathrm{Y}=\{\text{up, down}\}\), and if T=3, h=1, possibilities are
- {up, up, up, up}
 
- {down, down, down, down}
 
- {up, down, up, down}
 
- {up, up, up, down}
 
- etc (12 more)
 
 
- Considering all possibilities, try to find a rule that has small loss over the different sequences
 
- Issue: with many possible sequences, different rules will have different loss on each
 
- Possible solutions: one or more of
- Decide which sequences are “more important”
 
- Change the goal of the problem
 
 
 
Sequences
t<-c(1,2,3,4)
y1<-c(0.75,0.75,0.75,0.75)
y2<-c(0.5,0.5,0.5,0.5)
y3<-c(0.25,0.25,0.25,0.25)
y4<-c(0,0,0,0)
# y5<-c(-0.25,-0.25,-0.25,-0.25)
# y6<-c(-0.5,-0.5,-0.5,-0.5)
# y7<-c(-0.75,-0.75,-0.75,-0.75)
# y8<-c(-1,-1,-1,-1)
seq1<-c(1,1,1,1)
seq2<-c(0,0,0,0)
seq3<-c(1,0,1,0)
seq4<-c(1,1,1,0)
# seq5<-c(0,1,1,1)
# seq6<-c(0,1,1,0)
# seq7<-c(0,1,0,1)
# seq8<-c(0,1,0,0)
dframe<-data.frame(t,y1,y2,y3,y4,seq1,seq2,seq3,seq4)
library(ggplot2)
ggplot(data=dframe,mapping=aes(x=t))+
  geom_point(aes(y=y1,color=seq1),size=10)+
  geom_point(aes(y=y2,color=seq2),size=10)+
  geom_point(aes(y=y3,color=seq3),size=10)+
  geom_point(aes(y=y4,color=seq4),size=10)+
  theme(panel.background = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(),legend.position = "none")+
  labs(x=" ", y=" ",title="Example Sequences")

 
Preview of approaches
- Different decision theoretic philosophies approach tradeoffs in different ways
 
- Average vs worst case
- Can try to do reasonably well in any situation, even the worst
 
- Can try to do well over many situations, with different priority given to each
- Have different degrees of of prioritizing
 
 
 
- Absolute vs relative
- Can use a criterion which judges forecast only by its own loss
 
- Can instead try to do well relative to some benchmark (which may do better or worse depending on the situation)
 
 
- Average case approaches can fail badly in cases they assign low weight to
 
- Worst case methods will never do terribly, but may never do particularly well either
 
- Absolute criteria try to do okay in any situation
 
- Relative criteria are allowed to perform badly in “hard” cases, must do well in easy cases
 
 
A Crude Classification of Philosophies
library(knitr)
library(kableExtra)
text_tbl <- data.frame(
  Criteria = c("Absolute Criterion", "Relative Criterion"),
  Worst.Case = c(
    "Adversarial Prediction", 
    "Online Learning "
    ),
  Average.Average.Case = c(
    "Bayesianism",
    " Bayesian Regret* " 
    ),
  Worst.Average.Case = c(
    "Minimax Theory",
    "Statistical Learning" 
    )  
)
kable(text_tbl) %>%
  kable_styling(full_width = F) %>%
  column_spec(1,border_right = T) %>%
  column_spec(2) %>%
  add_header_above(c(" " = 2, "Average Case" = 2))
| 
 | 
 
Average Case
 
 | 
| 
Criteria
 | 
Worst.Case
 | 
Average.Average.Case
 | 
Worst.Average.Case
 | 
| 
Absolute Criterion
 | 
Adversarial Prediction
 | 
Bayesianism
 | 
Minimax Theory
 | 
| 
Relative Criterion
 | 
Online Learning
 | 
Bayesian Regret*
 | 
Statistical Learning
 | 
*rarely used, so will not be covered
 
Average case analysis: probability
- One way to quantify “more important” is to give a weighting to each sequence
 
- Most commonly, this is done by means of probability
- Roughly, a set of nonnegative weights that sum to 1
 
 
- Probability assigns to each subset \(A:=\{\omega\in \Omega: \omega\in A\}\) of a (discrete) set \(\Omega\) a number \(p(A)\) such that
- \(0\leq p(A) \leq 1\)
 
- \(p(\Omega)=1\)
 
- If \(A_1,A_2,\ldots\) are disjoint sets in \(\Omega\), \(p(\cup_{j}A_j)=\sum_j p(A_j)\)
 
 
- For continuous case, probability defined by a density \(p()\) which integrates to 1
 
- For forecasting, the set over which we define a probability is the set of sequences \(\otimes_{t=1}^{T+h}\mathrm{Y}\)
 
- Once a probability is defined, a sequence is a random variable
- A function of the event \(\omega\)
 
 
- E.g. might give weights \((1/4,1/4,1/4,1/4)\) to sequences
- \(\{1,1,1,1\}\), \(\{0,0,0,0\}\), \(\{1,0,1,0\}\), \(\{1,1,1,0\}\)
 
- Any two events have combined probability 1/2, any three \(3/4\), etc
 
 
 
Expected values and Risk
- We consider sequences with higher weight \(p\) more important to worry about
 
- Typical application is to calculate expected value (with respect to \(p\)) of some function of the sequence \[E_p[g(\mathcal{Y}_{T+h})]:=\sum_{\omega\in\Omega} p(\omega)g(\mathcal{Y}_{T+h}(\omega))\]
 
- For continuous case, use integral \[E_p[g(\mathcal{Y}_{T+h})]:=\int_{\omega\in\Omega} g(\mathcal{Y}_{T+h}(\omega))p(\omega)d\omega\]
 
- Most important expectation is expectation of loss function, called the risk \[R_p(f):=E_p[\ell(y_{T+h},f(\mathcal{Y}_T))]\]
 
- Given a distribution, we can now compare forecast methods based on a single number
- Smaller risk means smaller loss, on average
 
 
 
Average-Case decision-making
- Choosing a rule which is low risk tells us that our guess will do well across possible sequences
 
- Weights tell us about subjective importance of different sequences
 
- High probability sequences considered “more likely”
- One interpretation: given a large number of observed sequences, fraction \(p(A)\) will be in set \(A\)
 
- In general, do not actually observe many sequences, so this is purely theoretical
 
 
- You may not do well in every case, but you do relatively well in cases that are “more common”
 
- Risk, with distribution \(p\), allows us to define a best possible outcome, called the Bayes Risk \[R_p^{*}:=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\inf}R_p(f)\]
 
- If a minimizer exists, it is called the Bayes Forecast
 
 
Bayes forecasts
- When Bayes forecast exists, defined as average-case optimal forecast
- \(f_p^*(.):=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\arg\min}E_p\ell(y_{T+h},f(\mathcal{Y}_T))\)
 
 
- Analysis can be simplified using conditional probability
- For any two events \(A,B\), \(p(B|A):=\frac{p(B\cap A)}{p(A)}\)
 
- Conditional densities \(p(y|x):=\frac{p(y,x)}{p(x)}\)
 
 
- Can decompose multivariate expectation as average of conditional expectation
- \(E_p[g(x,y)]=\int\int g(x,y)p(y|x)p(x)dydx=E_{p(x)}[E_{p(y|x)}[g(x,y)|x]]\)
 
 
- Applying to risk, and considering \(h=1\) for simplicity
- \(f_p^*(.):=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\arg\min}\int\int\ell(y_{T+1},f(\mathcal{Y}_T))p(y_{T+1}|\mathcal{Y}_T)p(\mathcal{Y}_T)dy_{T+1}d\mathcal{Y}_T\)
 
- \(=\int\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\arg\min}[\int\ell(y_{T+1},f(\mathcal{Y}_T))p(y_{T+1}|\mathcal{Y}_T)dy_{T+1}]p(\mathcal{Y}_T)d\mathcal{Y}_T\)
 
 
- So \(f_p^*(\mathcal{Y}_T)=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\arg\min}[\int\ell(y_{T+1},f(\mathcal{Y}_T))p(y_{T+1}|\mathcal{Y}_T)dy_{T+1}]\)
- Bayes forecast minimizes conditional risk: conditional expectation of loss given observed data
 
 
 
Application: Sequences
- Go back to \({up,down}\) prediction problem
 
- Sequences \(\{1,1,1,1\}\), \(\{0,0,0,0\}\), \(\{1,0,1,0\}\),\(\{1,1,1,0\}\) w/ prob \((1/4,1/4,1/4,1/4)\)
 
- Let loss function be 0-1 loss: \(\ell(y,\widehat{y})=1\{y\neq\widehat{y}\}\)
 
- Suppose we observe \(\mathcal{Y}_T=\{1,0,1\}\)
- Then \(p(y_{T+1}=0|\mathcal{Y}_T)=\frac{p(1,0,1,0)}{p(1,0,1,0)+p(1,0,1,1)}=\frac{1/4}{1/4+0}=1\)
 
- \(p(y_{T+1}=1|\mathcal{Y}_T)=\frac{p(1,0,1,1)}{p(1,0,1,0)+p(1,0,1,1)}=\frac{0}{1/4+0}=0\)
 
 
ggplot(data=dframe,mapping=aes(x=t))+
  geom_point(aes(y=y1,color=seq1),size=10,alpha=0.1)+
  geom_point(aes(y=y2,color=seq2),size=10,alpha=0.1)+
  geom_point(aes(y=y3,color=seq3),size=10)+
  geom_point(aes(y=y4,color=seq4),size=10,alpha=0.1)+
  theme(panel.background = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(),legend.position = "none")+
  labs(x=" ", y=" ",title="Sequences After Conditioning") 
 
 
Sequence Example, ctd
- Conditional Risk is \(1\times \ell(0,f(\mathcal{Y}_T))+0\times \ell(1,f(\mathcal{Y}_T))=\ell(0,f(\mathcal{Y}_T))=1\{0\neq f(\mathcal{Y}_T)\}\)
 
- Minimize by setting \(f_p^*(\mathcal{Y}_T)=0\)
 
- Similarly, for \(\mathcal{Y}_T=\{0,0,0\}\), \(f_p^*(\mathcal{Y}_T)=0\)
 
- Observing \(\mathcal{Y}_T=\{1,1,1\}\), conditional risk is \(\frac{1}{2}\times 1\{0\neq f(\mathcal{Y}_T)\}+\frac{1}{2}\times 1\{1\neq f(\mathcal{Y}_T)\}\)
- Since multiple outcomes given equal weight, any prediction \(f_p^*(\mathcal{Y}_T)\in\{0,1\}\) is optimal
 
 
ggplot(data=dframe,mapping=aes(x=t))+
  geom_point(aes(y=y1,color=seq1),size=10,alpha=0.7)+
  geom_point(aes(y=y2,color=seq2),size=10,alpha=0.1)+
  geom_point(aes(y=y3,color=seq3),size=10,alpha=0.1)+
  geom_point(aes(y=y4,color=seq4),size=10,alpha=0.7)+
  theme(panel.background = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(),legend.position = "none")+
  labs(x=" ", y=" ",title="Sequences After Conditioning") 
 
- For outcomes with probability 0, like \(\{0,1,0\}\), again any prediction is optimal: changing it has no effect on risk
 
 
Worst case analysis of sequences
- To express extreme agnosticism, can try to find methods that work for all sequences, rather than on average over a distribution on sequences
 
- One potentially desirable goal would be to find a rule which performs well for all sequences in \(\otimes_{t=1}^{T+h} \mathrm{Y}\)
 
- That is, consider worst case loss over all sequences of decision rule \(f\) \[L^{\max}(f)=\underset{\mathcal{Y}_{T+h}\in\otimes_{t=1}^{T+h} \mathrm{Y}}{\sup}\ell(y_{T+h},f(\mathcal{Y}_T))\]
 
- This criterion then tells you loss of a “best” rule \[L^{\min\max}=\underset{f}{\inf}\underset{\mathcal{Y}_{T+h}\in\otimes_{t=1}^{T+h} \mathrm{Y}}{\sup}\ell(y_{T+h},f(\mathcal{Y}_T))\]
 
- In words, this says: pick a rule that has the best worst case outcome over any sequence
 
- This can be wildly pessimistic
- Does well in sequences where loss might be high, but may do quite poorly in sequences where loss may be low
 
 
- Also may not define rule uniquely
 
 
Adversarial Decision-Making
- There are ways to consider worse cases than even the worst sequence
 
- May have an outcome that responds to decision rule chosen, in a way which is adversarial
- The sequence adjusts to make outcome as bad as possible.
 
 
- This kind of analysis might make sense if outcome decided by malicious adversary
- A competitor (in business, politics, or games) wants you to do poorly
 
- Indeed, it comes from game theory, where this is the situation
 
 
- Standard in game theory to work with slightly modified spaces of rules: randomized strategies
- Choose random rule, and allow adversary to respond only to distribution
 
- Goal is to minimize expected loss, over own distribution choice \(\widehat{y}_{T+h}\sim \widehat{p}_{T+h}()\)
 
- Adversary also allowed to randomize \(y_{T+h}\sim q_{T+h}\) \[L^{Adversarial}= \underset{\widehat{p}_{T+h}()}{\inf}\underset{q_{T+h}}{\sup}E_{\widehat{p}_{T+h}(),q_{T+h}}\ell(y_{T+h},\widehat{y}_{T+h})\]
 
 
- In this setting, randomized algorithms can help by restricting the ability to respond of the adversary
 
 
Example: Sequences
- Let loss function be 0-1 loss: \(\ell(y,\widehat{y})=1\{y\neq\widehat{y}\}\)
 
- If all possible length 4 \(\{0,1\}\) sequences are in set considered, for any rule, always possible that next outcome is other outcome
 
- \(L^{\min\max}=1\)
- Cannot prevent mistakes in at least some cases
 
 
- \(L^{Adversarial}=\frac{1}{2}\)
- By guessing at random, any strategy of adversary will give wrong guess 1/2 the time
 
- But cannot do better than guessing at random
 
 
 
Online Learning and Regret Minimization
- Worst case and adversarial analysis do not let you take advantage of particular features of sequence
- Usually obtain trivial results
 
 
- Want good relative performance over sequences
- Attempt to do better on some sequences while allowing doing worse on others
 
 
- Asks for regret of learning rule
- Given set \(\mathcal{F}\) of comparison rules \(f():\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}\)
 
- Want rule which does not much worse than the rule in that set which is best ex post, after sequence is observed
 
- Usually compare over full sequence, where rule is updated each period \[Regret(\{f_t\}_{t=1}^{T})=\sum_{t=1}^{T}\ell(y_{t},f_t(\mathcal{Y}_{t-1}))-\underset{f\in\mathcal{F}}{\inf}\sum_{t=1}^{T}\ell(y_{t},f(\mathcal{Y}_{t-1}))\]
 
 
 
Applying Online Learning
- Usual goal is to minimize maximum regret over any possible sequence \(\{y_t\}_{t=1}^{T}\)
- Rules \(f_t\) are allowed to adapt to the sequence, and in some cases, be randomized
 
 
- Sequence may be allowed to adapt to (distribution) of rules
 
- Best possible result is
- \(\underset{f_T\sim p}{\inf}\underset{y_T\sim q}{\sup}\ldots\underset{f_T\sim p}{\inf}\underset{y_T\sim q}{\sup}Regret(\{f_t\}_{t=1}^{T})\)
 
 
- Smaller comparator class makes low regret easier, but allows worse absolute performance
 
- Could find good rule by solving long sequence of optimization problems
 
- In practice, some very simple rules are used, and shown to work well
 
 
Refining the average case approach: Probability Models
- Given a distribution \(p\), Bayes forecast tells us exactly what to do to get low risk
 
- Problem: Where does \(p\) that we are averaging over come from?
- Need background knowledge on which outcomes are likely
 
 
- Can we relax this requirement?
 
- One way: allow the probability distribution to be unknown
 
- Allow a family of probability distributions
- \(\{p(\mathcal{Y}_{T+h},\theta)\}_{\theta\in\Theta}\)
 
 
- Call such a family a (probability) model
 
- Different values of \(\theta\) can correspond to different features of sequence
- E.g., higher or lower values, trend, seasonality, variance and autocovariance, etc can be more or less likely
 
 
- With many probability distributions \(p_{\theta}\), have not one but many risk functions
- \(R_{p_{\theta}}(f):=\int \ell(y_{T+h},f(\mathcal{Y}_T))p(\mathcal{Y}_{T+h},\theta)d\mathcal{Y}_{T+h}\)
 
 
- Now different forecast rules will have risk which depends on \(\theta\)
 
 
Example Probability Models
- Any distribution over sequences can be part of a probability model
 
- Most general model: Set of all probability distributions
- Seems very flexible, but will be too large for most things we want to do with a model
 
 
- Alternative: all probability distributions that factorize
- \(\Theta=\{p: p(\mathcal{Y}_{T+h})=\Pi_{t=1}^{T+h}p(y_t)\}\)
 
- Such distributions are called iid - independent and identically distributed
 
- Almost every stats and ML class you take looks only at these
 
- In time series, called “White Noise”: uninteresting model
 
 
- Simple non-iid model: Normal-AR(1) Model
- Let \(\phi(x)=\frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}x^2)\) be standard normal density
 
- \(p(\mathcal{Y}_{T+h},\theta)=\phi(y_1)\cdot\Pi_{t=2}^{T+h}\phi(y_t-\theta y_{t-1})\)
 
- Recursively, \(y_{t}=\theta y_{t-1}+e_t\), \(e_t\overset{iid}{\sim} N(0,1)\)
 
 
- Will introduce many more models later in class
- Many related to simple forecasting rules we have covered
 
 
 
Working with probability models
- How can we assess forecasts with many distributions?
- No longer a single fixed standard
 
 
- Need some way of evaluation across distributions
 
- Things to consider
- Set of rules
 
- Set of risk functions
 
- Combination of different risk functions
 
 
- Different choices generate different approaches
- Minimax Theory
 
- Statistical Learning Theory
 
- Bayesianism
 
 
 
Minimax Theory
- Suppose we have a model class we think is useful
 
- But want to remain completely agnostic on which model in class is worthwhile
 
- Can evaluate forecast rule \(f\) by Maximum Risk \[\underset{\theta\in\Theta}\max R_{p_{\theta}}(f)\]
 
- A decision rule which performs well according to this criterion gives good performance on average for any distribution in a class
 
- Gives a single measure of quality, which averages over sequences according to a distribution
- But considers the worst case distribution when doing so
 
 
 
Applying Minimax Theory
- With a single measure of quality, easy to define a “best” result: the Minimax risk \[R^{\text{Minimax}}=\underset{f:\otimes_{t=1}^{T}\mathrm{Y}\to\mathrm{Y}}{\inf}\underset{\theta\in\Theta}{\sup} R_{p_{\theta}}(f)\]
 
- Size of set \(\{p_{\theta}\}_{\theta\in\Theta}\) determines tradeoff
- Large set of models implies good performance in more situations
 
- Small set of models improves performance within that set, since don’t have to worry about as many models
 
 
- In principle, could perform optimization and find minimizer
- In practice, hard to find, but common to come up with reasonable looking rule and show it comes close to minimizing
 
 
 
Statistical Learning
- Typical frequentist estimation tries to find methods which work well for all distributions in a class
 
- Sometimes helpful to compare predictions to best possible in some class \(\mathcal{F}\)
- So-called “oracle predictor”
 
- \(R_{p}^{\text{Oracle}}=\underset{f\in \mathcal{F}}{\inf}E_{p}\ell(y_{t+h},f(\mathcal{Y}_T))\) is best risk achievable with p known
 
 
- Goal is to find a procedure which does nearly as well as the oracle, even when p isn’t known
 
- Worst case relative risk of \(f\) is \[\underset{{\theta}\in\Theta}{\sup}(E_{p_{\theta}}\ell(y_{t+h},f(\mathcal{Y}_T))-R_{p_{\theta}}^{\text{Oracle}})\]
 
 
Appplying Statistical Learning
- Relative risk criterion allows for bad performance in hard cases where oracle also fails
 
- Contrast to minimax risk
- Asks for good performance in these cases, which may degrade performance in easier cases
 
 
- Smaller set \(\mathcal{F}\) allows for better relative performance
- But allows for worse absolute performance
 
 
- Larger set \(\Theta\) makes method more robust to different model features
- In ML, often allow all iid distributions
 
- Can extend result to larger classes to accommodate time series
- But smaller than all distributions over sequences
 
 
 
- Again, could simply optimize directly to find a good rule
- In practice, pick simple method and prove it comes close
 
 
 
Average case over distributions
- Again consider case where we have a probability model \(\{p(\mathcal{Y}_{T+h},\theta)\}_{\theta\in\Theta}\)
 
- Minimax risk may be too pessimistic: assigns a lot of weight to “bad” distributions
 
- Instead may prefer to average over distibutions in model
 
- Average over distributions must be with respect to some distribution \(\pi(\theta)\)
- \(\pi(\theta)\) is called a prior distribution
 
 
- Average risk of a rule \(f\) is then \(R_{\pi}(f)=E_{\pi}[R_{p_{\theta}}(f)]\)
- \(=\int[\int \ell(y_{T+h},f(\mathcal{Y}_T))p(\mathcal{Y}_{T+h},\theta)d\mathcal{Y}_{T+h}]\pi(\theta)d\theta\)
 
 
- Applying Fubini’s theorem (saying you can change order of integrands)
- \(=\int \ell(y_{T+h},f(\mathcal{Y}_T))[\int p(\mathcal{Y}_{T+h},\theta)\pi(\theta)d\theta]d\mathcal{Y}_{T+h}\)
 
 
- In other words, averaging case over models just gives you back average case over sequences
- Distribution now \(\int p(\mathcal{Y}_{T+h},\theta)\pi(\theta)d\theta\)
 
- Optimal risk is still the Bayes Risk, optimal forecast still the Bayes forecast
 
 
- Choosing a prior and computing the Bayes forecast will be subject of future classes on Bayesian methods
- Sneak Preview: It will involve calculating a whole lot of integrals
 
 
 
Criteria vs method
- All of above are ways of evaluating what a “good” forecast is
 
- Any forecasting method can be evaluated in these ways
- Once choices are made: loss function, models/hypotheses, distributions
 
 
- Sometimes, criterion tells you enough to find a “best” method
- Once chosen, just apply the best method
 
 
- Other times, apply method which is “good enough”
 
 
Future Classes: Forecasting theory
- Given a method + a criterion, show method does well by criterion
 
- Huge complicated area with a lot of math
 
- Class will contain many examples of methods + results
 
- Not systematic
- Grab bag of things that seem to work well
 
- Plus some justifications of when, why, how
 
 
- Start with “classical” statistical approach