Exchange Rate Forecasting
- Influential early forecasting study: Meese and Rogoff (1983)
- Context: huge change in US exchange rate policy from fixed to floating rates in 1970s
- Theory not well developed at the time
- Economic and business planning require good guesses of future exchange rate
- Future appreciation raises value of foreign assets and sales, depreciation lowers them
- Meese and Rogoff contribution: a forecasting “horse race”
- Take one problem (exchange rates), apply many methods
- Compare performance (MSE, MAE)
- Learn about what works
Exchange Rate Series
suppressPackageStartupMessages(library(fpp2))
library(fredr) #Access FRED, the Federal Reserve Economic Data
#To access data by API, using R rather than going to the website and downloading
#You need to request a key: you can do so at https://research.stlouisfed.org/docs/api/api_key.html
fredr_set_key("8782f247febb41f291821950cf9118b6") #Key I obtained for this class
#Get 1970s data
#Let's get the (main) series for the Yen-Dollar Exchange rate
EXJPUS<-fredr(series_id = "EXJPUS",observation_end=as.Date("1982-01-04"))
#You can also download this series at https://fred.stlouisfed.org/series/DEXJPUS
#Let's get the (main) series for the Dollar-Pound Exchange rate
EXUSUK<-fredr(series_id = "EXUSUK",observation_end=as.Date("1982-01-04"))
#Let's get the (main) series for the German Deutschmark-Dollar Exchange rate
EXGEUS<-fredr(series_id = "EXGEUS",observation_end=as.Date("1982-01-04"))
#convert to time series and normalize so 1971-1 value is 1
ukus<-ts(EXUSUK$value[1]/EXUSUK$value,start=c(1971,1),frequency=12)
jpus<-ts(EXJPUS$value/EXJPUS$value[1],start=c(1971,1),frequency=12)
geus<-ts(EXGEUS$value/EXGEUS$value[1],start=c(1971,1),frequency=12)
autoplot(ukus, series="Pound (U.K.)")+autolayer(jpus,series="Yen (Japan)")+autolayer(geus,series="Deutschmark (Germany)")+
labs(x="Date",y="Units/Dollar", title="Dollar Exchange Rates in the 1970s",
subtitle="Monthly From 1971 to 1982, Index Starting at 1 in January 1971")+
guides(colour=guide_legend(title="Currencies"))
The Statistical Approach
- Build a probability model of the time series
- Find a way to estimate the “true” parameters
- Construct forecast from estimated model
The Statistical Machine Learning approach
- Choose a class of forecast methods \(\mathcal{F}\)
- Called “hypotheses” in Machine Learning
- Find a way to pick the best performing method
- Construct a forecast by applying the chosen method
- Looks extremely similar to above
- Can be thought of as a generalization
- Difference 1: Data assumed to come from wider class of models
- Difference 2: Hypothesis class not restricted to same parameterization as models
- Today: Focus on shared features, using machine learning notation
Statistical Learning
- How does one pick the “best” method \(f\in\mathcal{F}\)
- Goal is to minimize risk \[R_{p}(f)=E_{p}[\ell(y_{t+h},f(\mathcal{Y}_T))]\]
- Question is, risk with respect to which distribution \(p\)?
- Assumption: there exists a “true” distribution \(p\in\mathcal{P}\) from which sequence is drawn
- Statistical Learning goal: For any distribution \(p\) in some class \(\mathcal{P}\)
- Find rule \(\widehat{f}\) that gives small Excess risk over Oracle Risk \(\underset{f\in\mathcal{F}}{\min}R_p(f)\)
- Letting \(f^{*}=\underset{f\in\mathcal{F}}{\arg\min}R_p(f)\), we want a rule that gives small value of \[R_p(\widehat{f})-R_p(f^{*})\]
- This says, for any distribution, want to perform well on average, relative to the best-in-class
Empirical Risk
- Problem: We don’t know the “true” risk because we don’t know \(p\)
- Attempted solution: approximate it using data
- Standard approximation: Empirical Risk \[\widehat{R}(f)=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))\]
- This is now something we can calculate
Empirical Risk Minimization
- If we knew the true risk, could find best procedure easily
- Minimize \(R_{p}(f)\) over class \(\mathcal{F}\)
- So, why not try to minimize empirical risk \[\widehat{f}^{ERM}=\underset{f\in\mathcal{F}}{\arg\min}\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))\]
- Forecasting rule: apply the Empirical Risk Minimizer (ERM) to the data \[f(\mathcal{Y}_T)=\widehat{f}^{ERM}(\mathcal{Y}_T)\]
- Is this a good idea?
- Can evaluate using decision theoretic criteria
- In particular, based on excess risk over oracle \(R_p(\widehat{f})-R_p(f^{*})\) \(\forall p\in\mathcal{P}\)
- Answer depends on \(\mathcal{P},\mathcal{F}\)
- Can provide conditions under which answer is: yes, it’s pretty good
Applying Empirical Risk Minimization
- ERM principle is standard way to turn parameterized family of forecasting rules into a single rule: e.g.
- Let \(\ell(y,\widehat{y})=(y-\hat{y})^2\) square loss
- Let \(\mathcal{F}=\Theta:=\{\theta:\ \theta\in\mathbb{R}\}\) the constant forecast rules
- ERM principle says to choose \[\widehat{\theta}^{ERM}=\underset{\theta\in\Theta}{\arg\min}\frac{1}{T-h}\sum_{t=1}^{T-h}(y_{t+h}-\theta)^2\]
- Solving optimization problem, obtain result \[\widehat{\theta}^{ERM}=\frac{1}{T-h}\sum_{t=1}^{T-h}y_{t+h}\]
- This gives (up to a sample shift) the mean forecast rule!
- Went from infinite class of forecast rules to single, intuitive one
- Many forecast functions implement ERM for parameter selection
- Arima command implements ERM for autoregression (AR(p)) forecast rule, h=1, and square loss
Applications to exchange rate data
- Forecast exchange rates with AR(p) and mean forecasts, selecting parameters by minimizing empirical square loss
#Mean forecasts
gemeanf<-forecast(meanf(geus))
jpmeanf<-forecast(meanf(jpus))
ukmeanf<-forecast(meanf(ukus))
#AR(2) forecasts
geARf<-forecast(Arima(geus,order=c(2,0,0)))
jpARf<-forecast(Arima(jpus,order=c(2,0,0)))
ukARf<-forecast(Arima(ukus,order=c(2,0,0)))
#Plot results
autoplot(ukus, series="Pound (U.K.)")+autolayer(jpmeanf,PI=FALSE,series="Japan Mean")+autolayer(gemeanf,PI=FALSE,series="Germany Mean")+
autolayer(ukARf, PI=FALSE, series="U.K. AR(2)")+autolayer(jpARf,PI=FALSE,series="Japan AR(2)")+autolayer(geARf,PI=FALSE,series="Germany AR(2)")+
autolayer(ukmeanf,PI=FALSE,series="U.K. Mean")+autolayer(jpus,series="Yen (Japan)")+autolayer(geus,series="Deutschmark (Germany)")+
labs(x="Date",y="Units/Dollar", title="Dollar Exchange Rates in the 1970s and ERM forecasts",
subtitle="Monthly From 1971 to 1982, Index Starting at 1 in January 1971, with Forecasts")+
guides(colour=guide_legend(title="Currency & Forecast Method"))
In-sample and out of sample error
- For any procedure \(\widehat{f}\) we apply, we care about the true risk \(R_p(\widehat{f})=E[\ell(y_{t+h},\widehat{f}(\mathcal{Y}_t))]\)
- This is never observable when distribution \(p\) not known
- Instead, only have the empirical risk \(\widehat{R}(\widehat{f})=\frac{1}{T-h}\sum_{t=1}^{T-h}[\ell(y_{t+h},\widehat{f}(\mathcal{Y}_t))]\)
- How good a measure is this? A trivial decomposition \[R_p(\widehat{f})= \stackrel{\text{Empirical Risk}}{\widehat{R}(\widehat{f})}+\stackrel{\text{Generalization Error}}{(R_p(\widehat{f})-\widehat{R}(\widehat{f}))}\]
- Quality of forecast is sum of two components
- How well that forecasting method performed in the past
- How much past performance differs from future results
- First component is minimized by ERM
- Do as well as possible in class \(\mathcal{F}\) on this component
- Bigger set \(\mathcal{F}\) always improves this
- Second component not knowable in practice
- But if we can somehow ensure it is small, procedure will do nearly as well in the future
- Conversely, if it is large, good observed performance not meaningful: method is said to overfit
Controlling overfitting
- How do we get small generalization error?
- Ex ante bounds: prove that procedure \(\widehat{f}\) has small generalization error
- This will depend on class \(\mathcal{P}\) of distributions and details of procedure: for ERM, class \(\mathcal{F}\) of hypotheses
- Simple upper bound on generalization error \(R_p(\widehat{f})-\widehat{R}(\widehat{f})\):
- Error of \(\widehat{f}\) can be no worse than Uniform Error: maximum error of any \(f\in\mathcal{F}\) \[(R_p(\widehat{f})-\widehat{R}(\widehat{f}))\leq \underset{f\in\mathcal{F}}{\sup}\left|\widehat{R}(f)-R_p(f)\right|\]
- Overfitting bounded when Empirical Risk is good approximation of true risk uniformly over \(\mathcal{F}\) for any \(p\in\mathcal{P}\)
- Easy to see that uniform error is increasing in size of \(\mathcal{F}\)
- More overfitting possible when choosing from many models
- ERM principle has tradeoff when choosing model \(\mathcal{F}\)
- Larger or more complicated class improves in sample fit, but increases overfitting
- To quantify this, need bounds on uniform error
- Can be shown for certain classes of distributions and models
Alternate viewpoint: Excess Risk
- Decomposition of risk into empirical risk and generalization error provides bounds which depend on data observed
- May be good or bad for different realizations: can’t tell us what class to choose ex ante
- Quantity we can control with ERM is excess risk relative to an oracle
- Perform (nearly) as well as best model in class, for any distribution \(p\in \mathcal{P}\) \[R_p(\widehat{f}^{ERM})-R_p(f^{*})\]
- Decomposes into \[\stackrel{\text{Generalization Error}}{(R_p(\widehat{f}^{ERM})-\widehat{R}(\widehat{f}^{ERM}))}+\stackrel{\text{Oracle Approximation Error}}{(\widehat{R}(\widehat{f}^{ERM})-R_p(f^{*}))}\] \[\leq(R_p(\widehat{f}^{ERM})-\widehat{R}(\widehat{f}^{ERM}))+(\widehat{R}(f^{*})-R_p(f^{*}))\]
- Upper bound because \(\widehat{f}^{ERM}\) gives smaller empirical risk than any other rule, even the one giving smallest true risk \[\leq2\underset{f\in\mathcal{F}}{\sup}\left|\widehat{R}(f)-R_p(f)\right|\]
- Relative risk of ERM is guaranteed to be small whenever uniform error is small for any \(p\in\mathcal{P}\)
Stationarity
- Stationarity is a feature of probability distributions over sequences \(\{y_t\}_{t=-\infty}^{\infty}\)
- It gives one formalization of idea that “future is like the past”
- Need something like this idea to claim that past data tells us about future data
- Take any set of time periods \(\mathcal{T}\subset \mathbb{N}\), ie \(\mathcal{T}=\{3,5,6,300,301,10000,\ldots\}\)
- Then the joint distribution of \(\{y_t\}_{t\in\mathcal{T}}\) is the same as the joint distribution of \(\{y_{t+h}\}_{t\in\mathcal{T}}\) for any \(h\in\mathbb{N}\)
- A shift forward or backwards in time leaves the distribution unchanged
- In distribution, parts of sequence in future will look like parts of sequence in past
- Implies mean, variance, covariance, etc unaffected by time shift
- Distributions over sequences which are stationary have similar patterns regardless of current date
- Need not have similar patterns conditional on data
- \(y_{t}\) may be correlated with \(y_{t+2}\), so conditional on \(y_t\), mean of \(y_{t+2}\) different
- But correlation should be same as between \(y_{t+60}\) and \(y_{t+62}\)
- Rules out many commonly observed patterns
- Seasonality, trends, and many other cases to be discussed later
Nonstationary Series: Simulations
library(knitr) #Tables
library(fpp2) #Forecast commands
library(kableExtra) #More tables
library(tidyverse) #Data manipulation
library(gridExtra) #Graph Display
tl<-240 #Length of simulated series
set.seed(55) #Initialize random number generator for reproducibility
e1<-ts(rnorm(tl)) #standard normal shocks
e2<-ts(rnorm(tl)) #standard normal shocks
e3<-ts(rnorm(tl)) #standard normal shocks
e4<-ts(rnorm(tl)) #standard normal shocks
rho<-0.7 #AR1 parameter
#AR2 parameters
b0<-3
b1<-0.4
b2<-0.3
ss<-0.98 #LArge AR parameter
yAR1<-numeric(tl)
yAR1[1]<-(1/(1-rho))*rnorm(1) #Initialize at random draw
yRW<-numeric(tl)
yslowAR<-numeric(tl)
for(s in 1:(tl-1)){
#Stationary AR(1)
yAR1[s+1]<-rho*yAR1[s]+e1[s]
#Stationary but close to nonstationary AR(1)
yslowAR[s+1]<-ss*yslowAR[s]+e4[s]
#Random walk
yRW[s+1]<-yRW[s]+e3[s]
}
yAR2<-numeric(tl)
yAR2[1]<-3/0.3 #Initialize at mean
yAR2[2]<-3/0.3 #Initialize at mean
for(s in 1:(tl-2)){
#Stationary AR(2)
yAR2[s+2]=b0+b1*yAR2[s+1]+b2*yAR2[s]+e2[s]
}
trend<-0.05*seq(1,tl) #Increasing trend
season<-cos((2*pi/12)*seq(1,tl)) #Seasonal pattern
#Jump at particular time
breakseries<-ts(numeric(tl),start=c(1990,1),frequency=12)
for(s in 1:(tl)){
if(s>180){
breakseries[s]<-6
}
}
#Represent as time series
yAR1<-ts(yAR1, start=c(1990,1),frequency=12)
yRW<-ts(yRW, start=c(1990,1),frequency=12)
yAR2<-ts(yAR2,start=c(1990,1),frequency=12)
yslowAR<-ts(yslowAR,start=c(1990,1),frequency=12)
#Construct nonstationary series
ytrend<-yAR2+trend
yseason<-yAR1+3*season
ybreak<-yAR1+breakseries
tsgraphs<-list()
tsgraphs[[1]]<-autoplot(ytrend)+
labs(y="Value",title="Series with Trend")
tsgraphs[[2]]<-autoplot(yseason)+
labs(y="Value",title="Series with Seasonal Pattern")
tsgraphs[[3]]<-autoplot(ybreak)+
labs(y="Value",title="Series with Structural Break")
tsgraphs[[4]]<-autoplot(yRW)+
labs(y="Value",title="Random Walk",subtitle="Nonstationary because variance grows")
grid.arrange(grobs=tsgraphs,nrow=2,ncol=2) #Arrange In 2x2 grid
Stationary Series: Simulations
autoplot(yAR1,series="AR(1), b=0.7")+autolayer(yAR2,series="AR(2)")+autolayer(yslowAR,series="AR(1), b=0.98")+
labs(y="Value",title="Stationary Series")+
guides(colour=guide_legend(title="Series"))
Approximating the Mean
- Sample average is used to approximate mean of (functions of) time series
- Want \(\frac{1}{T}\sum_{t=1}^{T}g(y_t)-E_p g(y_t)\) “small”, in some sense, when \(T\) is big
- For this to be reasonable, need stationarity (and finite mean)
- Mean \(E_p g(y_t)\) exists, independently of time \(t\)
- Also need that observed values in reasonable size finite sample not tend to be too far above or below
- Know one class of distributions where this is true: \(p\) is independent and identically distributed
- Then have Law of Large Numbers
- \(\forall p\) iid with finite mean, for all \(c>0\) \(p(\left|\frac{1}{T}\sum_{t=1}^{T}g(y_t)-E_p g(y_t)\right|>c)\to 0\) as \(T\to\infty\)
Weak Dependence and Ergodicity
- Don’t need actual independence for law of large numbers
- Just that sequence of large values likely to be balanced out by subsequent series of small values
- Sufficient condition: weak dependence
- As time between observations gets longer, relationship gets weaker
- Today and tomorrow may be highly correlated, but today and 10 years from now “close to” independent
- A distribution \(p\) over sequences is said to be ergodic if law of large numbers holds: for all \(c>0\) \[p(\left|\frac{1}{T}\sum_{t=1}^{T}g(y_t)-E_p g(y_t)\right|>c)\to 0\text{ as }T\to\infty\]
- For any \(p\) in class \(\mathcal{P}\) of stationary and ergodic distributions, sample average is eventually close to mean
- For long enough series of observations
- Sufficient conditions for ergodicity are fast enough decay of certain measures of dependence
- Autocorrelation function decreasing rapidly to 0 is characteristic of distributions where approximation error is small
Evaluating Stationarity and Weak Dependence
- How far sample mean is from population mean controlled by dependence
- Variance: \(E_p[y_t^2]-E_p[y_t]^2\) measures how far a variable deviates from mean
- Variance of sample mean \(Var(\frac{1}{T}\sum_{t=1}^{T}z_t)=\frac{1}{T^2}\sum_{s=1}^{T}\sum_{t=1}^{T}Cov(z_t,z_{s})\)
- With stationarity, equals sum of autocovariances \(Cov(z_1,z_1)+2\sum_{h=1}^{T-1}\frac{T-h}{T}Cov(z_1,z_{1+h})\overset{T\to\infty}{\to}\sum_{h=-\infty}^{+\infty}Cov(z_1,z_{1+h})\)
- If small, sample mean and population mean usually close
- If large or unbounded, sample mean may be arbitrarily far away
- Rules of thumb: autocovariances approximated by ACF, so
- If ACF falls to 0 exponentially fast, approximation is close
- If ACF goes to 0 but (polynomially) slowly, approximation works but need much more data
- If ACF decay looks like straight line, stationarity likely fails: approximation bad no matter how much data you have
- Consider distributions \(\mathcal{P}\) where for any bounded \(f,g\), \(Corr(f(y_{t}),g(y_{t+h}))<\rho(h)\) for some function \(\rho(h)\to0\)
- These are called \(\rho\)-mixing: Autocorrelation Function goes to 0
- Can replace correlation with other measures to get other types of mixing
- Ensures sufficient closeness for uniform bounds to work
ACFs of Example Series
ACFplots<-list()
ACFplots[[1]]<-ggAcf(yAR1)+
labs(title="Stationary AR(1)",subtitle="Exponential decay")
ACFplots[[2]]<-ggAcf(yAR2)+
labs(title="Stationary AR(2)",subtitle="Exponential decay")
ACFplots[[3]]<-ggAcf(yslowAR)+
labs(title="Barely Stationary AR(1)",subtitle="Slow but eventual decay")
ACFplots[[4]]<-ggAcf(yRW)+
labs(title="Random Walk",subtitle="Linear decay")
ACFplots[[5]]<-ggAcf(ybreak)+
labs(title="Series with Break",subtitle="Linear decay")
ACFplots[[6]]<-ggAcf(yseason)+
labs(title="Deterministic Seasonal Series",subtitle="Spikes do not decay")
ACFplots[[7]]<-ggAcf(ytrend)+
labs(title="Trending Series",subtitle="Linear Decay")
grid.arrange(grobs=ACFplots,nrow=4,ncol=2)
ACF of US Dollar Exchange Rates in 1970s
library(fpp2)
library(fredr) #Access FRED, the Federal Reserve Economic Data
#To access data by API, using R rather than going to the website and downloading
#You need to request a key: you can do so at https://research.stlouisfed.org/docs/api/api_key.html
fredr_set_key("8782f247febb41f291821950cf9118b6") #Key I obtained for this class
#Get 1970s data
#Let's get the (main) series for the Yen-Dollar Exchange rate
EXJPUS<-fredr(series_id = "EXJPUS",observation_end=as.Date("1982-01-04"))
#You can also download this series at https://fred.stlouisfed.org/series/DEXJPUS
#Let's get the (main) series for the Dollar-Pound Exchange rate
EXUSUK<-fredr(series_id = "EXUSUK",observation_end=as.Date("1982-01-04"))
#Let's get the (main) series for the German Deutschmark-Dollar Exchange rate
EXGEUS<-fredr(series_id = "EXGEUS",observation_end=as.Date("1982-01-04"))
#convert to time series and normalize so 1971-1 value is 1
ukus<-ts(EXUSUK$value[1]/EXUSUK$value,start=c(1971,1),frequency=12)
jpus<-ts(EXJPUS$value/EXJPUS$value[1],start=c(1971,1),frequency=12)
geus<-ts(EXGEUS$value/EXGEUS$value[1],start=c(1971,1),frequency=12)
currACFplots<-list()
currACFplots[[1]]<-ggAcf(jpus)+
labs(title="Autocorrelation Function of Dollar-Yen Exchange Rate",subtitle = "Linear rate of decline is characteristic of nonstationary series")
currACFplots[[2]]<-ggAcf(ukus)+
labs(title="Autocorrelation Function of Dollar-Pound Exchange Rate",subtitle = "Linear rate of decline is characteristic of nonstationary series")
currACFplots[[3]]<-ggAcf(geus)+
labs(title="Autocorrelation Function of Dollar-Deutschmark Exchange Rate",subtitle = "Linear rate of decline is characteristic of nonstationary series")
grid.arrange(grobs=currACFplots)
Conditions on Hypothesis Class
- For uniform error control need \(\mathcal{F}\) to be “not too complicated”
- Many measures of complexity used
- \(\left|\mathcal{F}\right|\) Number of hypotheses
- Compare method 1, method 2, method 3, etc
- If \(\log\left|\mathcal{F}\right|\approx T\), uniformity fails
- Note, hypothesis should not itself be based on choosing from many hypotheses
- Dimension of parameter space
- Eg, AR(p) class \(\mathcal{F}=\{f(\mathcal{Y}_T)=b_0+b_1y_{t-1}+b_2y_{T-2}+\ldots b_py_{T-p}:\ b\in\mathbb{R}^{p+1}\}\) has dimension \(p+1\)
- If dimension of parameter space \(\approx T\), uniformity fails
- Maximum correlation of predictor with random noise
- \(E_{\sigma}\underset{f\in\mathcal{F}}{\max}\frac{1}{T}\sum_{t=1}^T\sigma_tf(\mathcal{Y}_t)\), \(\sigma_t\overset{iid}{\sim}\{-1,+1\}\) w/ prob (1/2,1/2)
- Called Rademacher Complexity
- If some \(f\in{F}\) can perfectly match iid random noise, expect low error even with no true forecasting ability
- Choosing from too complicated a set of forecast rules means in sample fit can be good but out of sample fit terrible
A cautionary tale for overfitting
- Let \(\mathcal{F}\) be all polynomials of order \(T-h\) in \(y_{t}\)
- Can make forecast rule hit every point exactly in sample
- In sample empirical risk \(\underset{f\in\mathcal{F}}{\min}\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(y_{t}))\) is 0
- Now suppose true distribution is white noise and loss is quadratic
- IID Mean 0
- Risk minimizing forecast is trivial forecast
- \(\widehat{f}^{ERM}\) will be an \(T-h\) polynomial with crazy coefficients and 0 in sample risk
- On any new samples, performance will be worse than trivial forecast
Controlling Generalization Error: Practical Advice
- I did not provide precise statement or proof of generalization error bound theorems
- Usually statement is very complicated, and provable bounds very loose
- Takeaway: overfitting of ERM is small when
- Distribution is stationary and correlations over time go to 0 quickly
- Model size not too large: small number of choices, or few parameters
- Sample size is large: for small model classes and fast enough decay, generalization error bounded by constant times \(\frac{1}{\sqrt{T}}\)
- ERM does fine in these situations
- Excess risk \(R_p(\widehat{f}^{ERM})-R_p(f^{*})\) and generalization error \(R_p(\widehat{f})-\widehat{R}(\widehat{f})\) are small
- Can do poorly if models class is very complicated, data is strongly dependent or nonstationary, or samples are small
- In those cases, good in sample performance may still give bad out of sample performance in future predictions
What happens if these conditions fail?
- Generalization error bounds are upper bounds
- Sometimes error is smaller than predicted
- Generalization error sometimes still very small also for models with very high complexity
- Active research area on “benign overfitting” and “double descent” on when and why this happens (Hardt & Recht 2021 Ch 6)
- But if no guarantee or large upper bound, generalization error can be very bad
- With nonstationarity, it may not become smaller as time grows
- Without weak dependence, permitted error may be huge until T huge
- ERM minimizes one component of risk, but total risk might not be minimized
- There are some cases where better options than ERM exist
- With complicated models, total risk may be worse for ERM than some other method
Simulations Showing Overfitting as Function of Complexity
- Consider AR(d) classes \(\mathcal{F}_d=\{f(\mathcal{Y}_T)=b_0+b_1y_{t-1}+b_2y_{T-2}+\ldots b_dy_{T-d}:\ b\in\mathbb{R}^{d+1}\}\)
- Use square loss \(\ell(y,\widehat{y})=(y-\widehat{y})^2\)
- Empirical risk is Mean Squared Error (MSE)
- Optimal forecast is the conditional mean
- Simulate series \(\mathcal{Y}_T\) of length \(T\) from a distribution \(p\)
- For simplicity of exposition and analysis, let \(p\) be IID N(0,1)
- Stationary and as weakly dependent as possible, so we can focus on complexity only
- Best possible forecast is \(\widehat{y}=0\) and oracle risk is 1
- Run ERM over data for each \(\mathcal{F}_d\)
- Simulate many times to calculate average empirical risk and average population risk
- Average population risk \(E_p[(y_{T+1}-\widehat{f}^{ERM}(\mathcal{Y}_T))^2]\)
- \(=E_{\mathcal{Y}_T}[E_{y_{T+1}}[(y_{T+1}-\widehat{f}^{ERM}(\mathcal{Y}_T))^2|\mathcal{Y}_T]]=E_{\mathcal{Y}_T}[1+(\widehat{f}^{ERM}(\mathcal{Y}_T))^2]\approx 1+\frac{1}{N_{sims}}\sum_{n=1}^{N_{sims}}(\widehat{f}^{ERM}(\mathcal{Y}^{n}_T))^2\) by iid law of large numbers
- Average empirical risk approximated by \(\frac{1}{N_{sims}}\sum_{n=1}^{N_{sims}}MSE(\widehat{f}^{ERM}_n)\)
- Compute average generalization error by average difference of empirical and population risk
- Compute quantile (object bounded by theory) by quantile of difference over simulations
Simulation Results
#Tlist<-c(50,100,300,500,1000)
Tlist<-c(80)
plist<-c(1,2,3,5,8,12,18,25,30,35,40,45)
nsims<-500
generalizationerror<-matrix(nrow=length(Tlist),ncol=length(plist))
for(T in 1:length(Tlist)){
empiricalrisk<-matrix(nrow=nsims,ncol=length(plist))
fcast<-matrix(nrow=nsims,ncol=length(plist))
for(n in 1:nsims){
datasim<-ts(rnorm(Tlist[T]))
for(p in 1:length(plist)){
arfc<-ar(datasim,aic = FALSE,order = plist[p])
fcast[n,p]<-forecast(arfc,1)$mean
empiricalrisk[n,p]<-mean((arfc$resid)^2,na.rm =TRUE)
}
}
approxrisk<-1+colMeans(fcast^2)
#Population risk is average over data sets of conditional risk given data
#Conditional Risk given data is just mean squared error over y of fixed forecast yhat
#E_p(y-yhat)^2=1+yhat^2 using fact that true y is iid N(0,1)
averageemprisk<-colMeans(empiricalrisk)
generalizationerror[T,]<-approxrisk-averageemprisk
}
generrordist<-approxrisk-empiricalrisk
generrorq90<-numeric(length(plist))
for(p in 1:length(plist)){
generrorq90[p]<-quantile(generrordist[,p],0.9) #90% quantile of generalization error for model p
}
oraclerisk<-numeric(length(plist))+1 #Oracle risk=1 for all models, because true data iid N(0,1), so optimal forecast is 0, with MSE 1
#Collect into dataset format for plotting
generalizationgap<-as.vector(generalizationerror[1,])
errormeasures<-data.frame(plist,approxrisk,averageemprisk,generalizationgap,generrorq90,oraclerisk)
errortable<-errormeasures %>%
gather(approxrisk,averageemprisk,generalizationgap,generrorq90,oraclerisk,key="RiskMeasure",value="Risk")
#Plot Results
ggplot(errortable,aes(x=plist,y=Risk,color=RiskMeasure))+geom_point(size=2)+
scale_color_discrete(labels =
c("(Approximate) Population Risk", "Average Empirical Risk",
"Average Generalization Error","90% Quantile of Generalization Error", "Oracle Risk")) +
labs(title="Empirical vs Population Risk for AR(d) Models",
subtitle="T=80, True Data IID N(0,1), 500 draws of dataset simulated",
x="Order of Autoregressive Process",
y="Average Square Loss",
color="Risk Measure")
Interpretation
- Average empirical risk goes down with complexity measure \(d\)
- Complexity gives better past performance because we are looking at best past performance of any rule in \(\mathcal{F}\)
- A fixed rule may perform better or worse in future, with no reason to think it will be either
- But selecting based on past performance means finding models with good past performance, holding fixed future performance
- Generalization error and excess risk relative to oracle rise with \(d\)
- More complicated model gives more opportunity for overfitting
- If we care about population risk, must trade off empirical and generalization errors
- First goes down, second goes up with complexity
- For this simulation, optimum is reached at 0 complexity
- This is because data distribution has minimal risk in 0 parameter class
- Not general: if more complexity improves fit, empirical risk can go down faster and optimum reached at some intermediate \(d\)
- In general, nothing forces tradeoff to be optimized at a “true” class \(\mathcal{F}\)
- Generalization error depends basically just on size of \(\mathcal{F}\)
- Adding lots of terrible rules to \(\mathcal{F}\) just makes generalization error worse
- Adding low risk rules to \(\mathcal{F}\) makes generalization error worse, but total error better
Generalization Error Control: Alternatives
- One more way to learn about generalization error \(R_p(\widehat{f})-\widehat{R}(\widehat{f})\)
- Instead of an upper bound, produce an estimate
- \(\widehat{R}(\widehat{f})\) is known, so just need to estimate \(R_p(\widehat{f})\)
- If we have good estimate of generalization error, can be confident that an estimator with low empirical risk + estimated generalization error will perform well
- How to estimate \(R_p(\widehat{f})\)?
- \(\widehat{R}(\widehat{f})\) is an okay estimate if you already know generalization error is negligible
- But, especially for ERM, is extremely likely to be an underestimate
- When same data used for estimating risk and choosing model, model depends on error in risk estimate
- Produces trivial generalization error estimate
Validation in practice
- Better choice: Use independent sequence \(\{y_s\}_{s=1}^{S}\) with same distribution as \(\{y_t\}_{t=1}^T\), called a test set
- Estimate risk by empirical test set risk \(\frac{1}{S}\sum_{s=1}^{S}\ell(y_{s+h},\widehat{f}(\mathcal{Y}_s))\)
- Since \(\widehat{f}\) independent of test set, if distribution stationary and ergodic, can bound approximation error by LLN
- Don’t need uniform bounds since only have one function
- Problem: Unless data known to be iid, can’t obtain such a test set
- Or if distribution known, as in simulations above
- Alternative 1: Blocking
- Split sample into 3 time ordered subsets: Training set, middle, and Test set
- Run ERM on training set, estimate risk on test set, discard the middle
- If data stationary and weakly dependent and middle set large, test set close to independent of training
- Requires discarding data, making sample sizes small
- Alternative 2: forecasting from a rolling origin (time series Cross Validation)
- Perform ERM on first \(t\) and estimate loss from point \(t+h\) each time
- Doesn’t use test data for optimization at any given time point
- Many ERM iterations required, and not independent since overlapping, so analysis difficult
Meese and Rogoff, Results
- Meese and Rogoff looked at large collection of hypothesis classes for forecasting 1970s exchange rate data
- Mostly multivariate models with form and variables inspired by 1970s economic theory
- Variety of parameter selection procedures, but mostly Empirical Risk Minimization
- Training set is 1971-1978, test set is 1978-81 (no middle block)
- Loss functions are MAE, RMSE
- Conclusions
- Big complicated models produced by economists do very badly
- Best performer is naive (random walk) forecast: \(\widehat{y}_{t+h}=y_t\)
- Random walk forecast performs even better than AR forecast, even though random walk is special case of AR forecast
What does this tell us?
- What do ERM principles say about good classes of forecasting methods?
- Model should be simple, to reduce overfitting error
- Economic models have more parameters, whose values may fit to noise and reduce accuracy
- Fact that AR model (with parameter to choose) does worse than special case of AR model can only be due to overfitting
- For ERM to produce good estimate, data should be stationary and weak dependent
- Does this apply to exchange rates? We can look at their autocorrelation functions
- Linear or slower decay occurs when series is trending or otherwise nonstationary
- See this in all series, suggesting caution about this assumption
- Approximation error should decrease in length \(T\) of the sample time series
- 8 years of data is moderate amount for monthly series
- But amount error decreases with time depends on strength of dependence
- If strong, like ACF shows, need much longer series to reduce error same amount
Interpretation of Results
- Many economists interpreted study as saying that random walk is best economic model of exchange rates
- Models based on other variables do not help forecast movements
- High persistence of series is what you would expect to see from a random walk
- More on this in a later class
- But other models also have many features that make ERM forecast display large error
- This is true regardless of whether models are “true” in sense of providing better probability model of the data
- Forecasting and estimation are separate tasks
- If goal is to forecast, simplicity is a virtue!
Next topics
- More hypothesis classes
- Finagling stationarity and weak dependence out of clearly nonstationary or highly dependent data
References
- Moritz Hardt and Benjamin Recht. Ch. 6 “Generalization” in Patterns, predictions, and actions: A story about machine learning (2021) https://mlstory.org/
- Richard Meese and Kenneth Rogoff. “Empirical Exchange Rate Models of the Seventies. Do they Fit out of Sample?” Journal of International Economics 14 (1983) pp. 3-24
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. “Foundations of Machine Learning” 2e MIT Press 2018
- Textbook treatment of statistical learning
- Mehryar Mohri and Afshin Rostamizadeh “Rademacher Complexity Bounds for Non-I.I.D. Processes” Advances in Neural Information Processing Systems (NIPS 2008). pp 1097-1104, Vancouver, Canada, 2009. MIT Press.
- Extension of textbook statistical learning results to time series data
- Bin Yu. “Rates of Convergence for Empirical Processes of Stationary Mixing Sequences” The Annals of Probability, Vol. 22, No. 1 (Jan., 1994), pp. 94-116
- Foundational paper on when ERM works for time series data