Statistical Learning: Review
- Usual Setup
- Data \(\mathcal{Y}_{T+h}=\{y_{t}\}_{t=1}^{T+h}\)
- Goal is to produce forecast \(\widehat{y}_{T+h}\) of \(y_{T+h}\) using forecasting rule \(\widehat{y}_{T+h}=f(\mathcal{Y}_{T})\)
- Statistical machine learning approach builds on probability model
- Series \(\mathcal{Y}_{T+h}\) comes from a true but unknown probability distribution \(p\in\mathcal{P}\)
- Want to obtain small risk \(R_{p}(f)=E_{p}\ell(y_{T+h},\widehat{y}_{T+h})\) for any true \(p\in\mathcal{P}\)
- Goal is to do well relative to one of two possible comparisons
- Oracle risk of class \(\mathcal{F}\) of forecasting rules: \(R^{*}_{p}(\mathcal{F})=\underset{f\in\mathcal{F}}{\inf}R_{p}(f)\)
- Empirical Risk \(\widehat{R}(f)=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))\)
- A forecasting rule \(\widehat{f}\) should have small values, for any \(p\in\mathcal{P}\) of
- Excess risk \(R_p(\widehat{f})-R^{*}_{p}(\mathcal{F})\)
- Generalization Error \(R_p(\widehat{f})-\widehat{R}(\widehat{f})\)
Empirical Risk Minimization
- Given loss function \(\ell(.,.)\) and class of forecasting rules \(\mathcal{F}\), a useful method is the Empirical Risk Minimizer \[\widehat{f}^{ERM}(\mathcal{Y}_{T})=\underset{f\in\mathcal{F}}{\min}\widehat{R}(f)\]
- Uses rule in the class with best past performance
- This is dangerous! In general, this need not give good results.
- When is this good idea?
- Only if \(\mathcal{P}\) and \(\mathcal{F}\) satisfy very particular conditions
- Previous class: both excess risk and generalization error of \(\widehat{f}^{ERM}\) are bounded above by (constant times)
- Uniform Error \(\underset{f\in\mathcal{F}}{\sup}\left|\widehat{R}(f)-R_p(f)\right|\)
- Look for conditions on \(\mathcal{P}\) and \(\mathcal{F}\) so that \(\forall p\in\mathcal{P}\) Uniform Error is small
Distribution conditions
- Stationarity means that shifting backwards or forwards in time doesn’t change distribution
- \(\forall\mathcal{T}\subset\mathbb{N}\), \(\forall{h}\), for any event \(A\), \(p(\{y_t\}_{t\in\mathcal{T}}\in A)=p(\{y_{t+h}\}_{t\in\mathcal{T}}\in A)\)
- Rules out deterministic seasonality, trends, breaks or shifts
- If you see these, good chance that it fails!
- Ergodicity means sample average is usually close to mean in large samples
- \(p(\left|\frac{1}{T}\sum_{t=1}^{T}g(y_t)-E_p g(y_t)\right|>c)\to 0\text{ as }T\to\infty\)
- Weak dependence strengthens error bound on mean to hold for finite T
- Rules out series where cyclical fluctuations last “too long” relative to length T of sample
- Requires values far apart in time to be close to independent
Nonstationary Series: Simulations
library(knitr) #Tables
library(fpp2) #Forecast commands
library(kableExtra) #More tables
library(tidyverse) #Data manipulation
library(gridExtra) #Graph Display
tl<-240 #Length of simulated series
set.seed(55) #Initialize random number generator for reproducibility
e1<-ts(rnorm(tl)) #standard normal shocks
e2<-ts(rnorm(tl)) #standard normal shocks
e3<-ts(rnorm(tl)) #standard normal shocks
e4<-ts(rnorm(tl)) #standard normal shocks
rho<-0.7 #AR1 parameter
#AR2 parameters
b0<-3
b1<-0.4
b2<-0.3
ss<-0.98 #LArge AR parameter
yAR1<-numeric(tl)
yAR1[1]<-(1/(1-rho))*rnorm(1) #Initialize at random draw
yRW<-numeric(tl)
yslowAR<-numeric(tl)
for(s in 1:(tl-1)){
#Stationary AR(1)
yAR1[s+1]<-rho*yAR1[s]+e1[s]
#Stationary but close to nonstationary AR(1)
yslowAR[s+1]<-ss*yslowAR[s]+e4[s]
#Random walk
yRW[s+1]<-yRW[s]+e3[s]
}
yAR2<-numeric(tl)
yAR2[1]<-3/0.3 #Initialize at mean
yAR2[2]<-3/0.3 #Initialize at mean
for(s in 1:(tl-2)){
#Stationary AR(2)
yAR2[s+2]=b0+b1*yAR2[s+1]+b2*yAR2[s]+e2[s]
}
trend<-0.05*seq(1,tl) #Increasing trend
season<-cos((2*pi/12)*seq(1,tl)) #Seasonal pattern
#Jump at particular time
breakseries<-ts(numeric(tl),start=c(1990,1),frequency=12)
for(s in 1:(tl)){
if(s>180){
breakseries[s]<-6
}
}
#Represent as time series
yAR1<-ts(yAR1, start=c(1990,1),frequency=12)
yRW<-ts(yRW, start=c(1990,1),frequency=12)
yAR2<-ts(yAR2,start=c(1990,1),frequency=12)
yslowAR<-ts(yslowAR,start=c(1990,1),frequency=12)
#Construct nonstationary series
ytrend<-yAR2+trend
yseason<-yAR1+3*season
ybreak<-yAR1+breakseries
tsgraphs<-list()
tsgraphs[[1]]<-autoplot(ytrend)+
labs(y="Value",title="Series with Trend")
tsgraphs[[2]]<-autoplot(yseason)+
labs(y="Value",title="Series with Seasonal Pattern")
tsgraphs[[3]]<-autoplot(ybreak)+
labs(y="Value",title="Series with Structural Break")
tsgraphs[[4]]<-autoplot(yRW)+
labs(y="Value",title="Random Walk",subtitle="Nonstationary because variance grows")
grid.arrange(grobs=tsgraphs,nrow=2,ncol=2) #Arrange In 2x2 grid
Stationary Series: Simulations
autoplot(yAR1,series="AR(1), b=0.7")+autolayer(yAR2,series="AR(2)")+autolayer(yslowAR,series="AR(1), b=0.98")+
labs(y="Value",title="Stationary Series")+
guides(colour=guide_legend(title="Series"))
Evaluating Stationarity and Weak Dependence
- How far sample mean is from population mean controlled by dependence
- Variance: \(E_p[y_t^2]-E_p[y_t]^2\) measures how far a variable deviates from mean
- Variance of sample mean \(Var(\frac{1}{T}\sum_{t=1}^{T}z_t)=\frac{1}{T^2}\sum_{s=1}^{T}\sum_{t=1}^{T}Cov(z_t,z_{s})\)
- With stationarity, equals sum of autocovariances \(Cov(z_1,z_1)+2\sum_{h=1}^{T-1}\frac{T-h}{T}Cov(z_1,z_{1+h})\overset{T\to\infty}{\to}\sum_{h=-\infty}^{+\infty}Cov(z_1,z_{1+h})\)
- If small, sample mean and population mean usually close
- If large or unbounded, sample mean may be arbitrarily far away
- Rules of thumb: autocovariances approximated by ACF, so
- If ACF falls to 0 exponentially fast, approximation is close
- If ACF goes to 0 but (polynomially) slowly, approximation works but need much more data
- If ACF decay looks like straight line, stationarity likely fails: approximation bad no matter how much data you have
- Consider distributions \(\mathcal{P}\) where for any bounded \(f,g\), \(Corr(f(y_{t}),g(y_{t+h}))<\rho(h)\) for some function \(\rho(h)\to0\)
- These are called \(\rho\)-mixing: Autocorrelation Function goes to 0
- Can replace correlation with other measures to get other types of mixing
- Ensures sufficient closeness for uniform bounds to work
ACFs of Example Series
ACFplots<-list()
ACFplots[[1]]<-ggAcf(yAR1)+
labs(title="Stationary AR(1)",subtitle="Exponential decay")
ACFplots[[2]]<-ggAcf(yAR2)+
labs(title="Stationary AR(2)",subtitle="Exponential decay")
ACFplots[[3]]<-ggAcf(yslowAR)+
labs(title="Barely Stationary AR(1)",subtitle="Slow but eventual decay")
ACFplots[[4]]<-ggAcf(yRW)+
labs(title="Random Walk",subtitle="Linear decay")
ACFplots[[5]]<-ggAcf(ybreak)+
labs(title="Series with Break",subtitle="Linear decay")
ACFplots[[6]]<-ggAcf(yseason)+
labs(title="Deterministic Seasonal Series",subtitle="Spikes do not decay")
ACFplots[[7]]<-ggAcf(ytrend)+
labs(title="Trending Series",subtitle="Linear Decay")
grid.arrange(grobs=ACFplots,nrow=4,ncol=2)
ACF of US Dollar Exchange Rates in 1970s
library(fpp2)
library(fredr) #Access FRED, the Federal Reserve Economic Data
#To access data by API, using R rather than going to the website and downloading
#You need to request a key: you can do so at https://research.stlouisfed.org/docs/api/api_key.html
fredr_set_key("8782f247febb41f291821950cf9118b6") #Key I obtained for this class
#Get 1970s data
#Let's get the (main) series for the Yen-Dollar Exchange rate
EXJPUS<-fredr(series_id = "EXJPUS",observation_end=as.Date("1982-01-04"))
#You can also download this series at https://fred.stlouisfed.org/series/DEXJPUS
#Let's get the (main) series for the Dollar-Pound Exchange rate
EXUSUK<-fredr(series_id = "EXUSUK",observation_end=as.Date("1982-01-04"))
#Let's get the (main) series for the German Deutschmark-Dollar Exchange rate
EXGEUS<-fredr(series_id = "EXGEUS",observation_end=as.Date("1982-01-04"))
#convert to time series and normalize so 1971-1 value is 1
ukus<-ts(EXUSUK$value[1]/EXUSUK$value,start=c(1971,1),frequency=12)
jpus<-ts(EXJPUS$value/EXJPUS$value[1],start=c(1971,1),frequency=12)
geus<-ts(EXGEUS$value/EXGEUS$value[1],start=c(1971,1),frequency=12)
currACFplots<-list()
currACFplots[[1]]<-ggAcf(jpus)+
labs(title="Autocorrelation Function of Dollar-Yen Exchange Rate",subtitle = "Linear rate of decline is characteristic of nonstationary series")
currACFplots[[2]]<-ggAcf(ukus)+
labs(title="Autocorrelation Function of Dollar-Pound Exchange Rate",subtitle = "Linear rate of decline is characteristic of nonstationary series")
currACFplots[[3]]<-ggAcf(geus)+
labs(title="Autocorrelation Function of Dollar-Deutschmark Exchange Rate",subtitle = "Linear rate of decline is characteristic of nonstationary series")
grid.arrange(grobs=currACFplots)
Conditions on Hypothesis Class
- For uniform error control need \(\mathcal{F}\) to be “not too complicated”
- Many measures of complexity used
- \(\left|\mathcal{F}\right|\) Number of hypotheses
- Compare method 1, method 2, method 3, etc
- If \(\log\left|\mathcal{F}\right|\approx T\), uniformity fails
- Note, hypothesis should not itself be based on choosing from many hypotheses
- Dimension of parameter space
- Eg, AR(p) class \(\mathcal{F}=\{f(\mathcal{Y}_T)=b_0+b_1y_{t-1}+b_2y_{T-2}+\ldots b_py_{T-p}:\ b\in\mathbb{R}^{p+1}\}\) has dimension \(p+1\)
- If dimension of parameter space \(\approx T\), uniformity fails
- Maximum correlation of predictor with random noise
- \(\underset{f\in\mathcal{F}}{\max}E_{\sigma}\frac{1}{T}\sum_{t=1}^T\sigma_tf(\mathcal{Y}_t)\), \(\sigma_t\overset{iid}{\sim}\{-1,+1\}\) w/ prob (1/2,1/2)
- Called Rademacher Complexity
- If some \(f\in{F}\) can perfectly match iid random noise, expect low error even with no true forecasting ability
- Choosing from too complicated a set of forecast rules means in sample fit can be good but out of sample fit terrible
Controlling Generalization Error: Practical Advice
- I did not provide precise statement or proof of generalization error bound theorems
- Usually statement is very complicated, and provable bounds very loose
- Takeaway: overfitting of ERM is small when
- Distribution is stationary and correlations over time go to 0 quickly
- Model size not too large: small number of choices, or few parameters
- Sample size is large: for small model classes and fast enough decay, generalization error bounded by constant times \(\frac{1}{\sqrt{T}}\)
- ERM does fine in these situations
- Excess risk \(R_p(\widehat{f}^{ERM})-R_p(f^{*})\) and generalization error \(R_p(\widehat{f})-\widehat{R}(\widehat{f})\) are small
- Can do poorly if models class is very complicated, data is strongly dependent or nonstationary, or samples are small
- In those cases, good in sample performance may still give bad out of sample performance in future predictions
What happens if these conditions fail?
- Generalization error bounds are upper bounds
- Sometimes error is smaller than predicted
- But if no guarantee or large upper bound, generalization error can be very bad
- With nonstationarity, it may not become smaller as time grows
- Without weak dependence, permitted error may be huge until T huge
- ERM minimizes one component of risk, but total risk might not be minimized
- There are some cases where better options than ERM exist
- With complicated models, total risk may be worse for ERM than some other method
Simulations Showing Overfitting as Function of Complexity
- Consider AR(d) classes \(\mathcal{F}_d=\{f(\mathcal{Y}_T)=b_0+b_1y_{t-1}+b_2y_{T-2}+\ldots b_dy_{T-d}:\ b\in\mathbb{R}^{d+1}\}\)
- Use square loss \(\ell(y,\widehat{y})=(y-\widehat{y})^2\)
- Empirical risk is Mean Squared Error (MSE)
- Optimal forecast is the conditional mean
- Simulate series \(\mathcal{Y}_T\) of length \(T\) from a distribution \(p\)
- For simplicity of exposition and analysis, let \(p\) be IID N(0,1)
- Stationary and as weakly dependent as possible, so we can focus on complexity only
- Best possible forecast is \(\widehat{y}=0\) and oracle risk is 1
- Run ERM over data for each \(\mathcal{F}_p\)
- Simulate many times to calculate average empirical risk and average population risk
- Average population risk \(E_p[(y_{T+1}-\widehat{f}^{ERM}(\mathcal{Y}_T))^2]\)
- \(=E_{\mathcal{Y}_T}[E_{y_{T+1}}[(y_{T+1}-\widehat{f}^{ERM}(\mathcal{Y}_T))^2|\mathcal{Y}_T]]=E_{\mathcal{Y}_T}[1+(\widehat{f}^{ERM}(\mathcal{Y}_T))^2]\approx 1+\frac{1}{N_{sims}}\sum_{n=1}^{N_{sims}}(\widehat{f}^{ERM}(\mathcal{Y}^{n}_T))^2\) by iid law of large numbers
- Average empirical risk approximated by \(\frac{1}{N_{sims}}\sum_{n=1}^{N_{sims}}MSE(\widehat{f}^{ERM}_n)\)
- Compute average generalization error by average difference of empirical and population risk
- Compute quantile (object bounded by theory) by quantile of difference over simulations
Simulation Results
#Tlist<-c(50,100,300,500,1000)
Tlist<-c(80)
plist<-c(1,2,3,5,8,12,18,25,30,35,40,45)
nsims<-500
generalizationerror<-matrix(nrow=length(Tlist),ncol=length(plist))
for(T in 1:length(Tlist)){
empiricalrisk<-matrix(nrow=nsims,ncol=length(plist))
fcast<-matrix(nrow=nsims,ncol=length(plist))
for(n in 1:nsims){
datasim<-ts(rnorm(Tlist[T]))
for(p in 1:length(plist)){
arfc<-ar(datasim,aic = FALSE,order = plist[p])
fcast[n,p]<-forecast(arfc,1)$mean
empiricalrisk[n,p]<-mean((arfc$resid)^2,na.rm =TRUE)
}
}
approxrisk<-1+colMeans(fcast^2)
#Population risk is average over data sets of conditional risk given data
#Conditional Risk given data is just mean squared error over y of fixed forecast yhat
#E_p(y-yhat)^2=1+yhat^2 using fact that true y is iid N(0,1)
averageemprisk<-colMeans(empiricalrisk)
generalizationerror[T,]<-approxrisk-averageemprisk
}
generrordist<-approxrisk-empiricalrisk
generrorq90<-numeric(length(plist))
for(p in 1:length(plist)){
generrorq90[p]<-quantile(generrordist[,p],0.9) #90% quantile of generalization error for model p
}
oraclerisk<-numeric(length(plist))+1 #Oracle risk=1 for all models, because true data iid N(0,1), so optimal forecast is 0, with MSE 1
#Collect into dataset format for plotting
generalizationgap<-as.vector(generalizationerror[1,])
errormeasures<-data.frame(plist,approxrisk,averageemprisk,generalizationgap,generrorq90,oraclerisk)
errortable<-errormeasures %>%
gather(approxrisk,averageemprisk,generalizationgap,generrorq90,oraclerisk,key="RiskMeasure",value="Risk")
#Plot Results
ggplot(errortable,aes(x=plist,y=Risk,color=RiskMeasure))+geom_point(size=2)+
scale_color_discrete(labels =
c("(Approximate) Population Risk", "Average Empirical Risk",
"Average Generalization Error","90% Quantile of Generalization Error", "Oracle Risk")) +
labs(title="Empirical vs Population Risk for AR(d) Models",
subtitle="T=80, True Data IID N(0,1), 500 draws of dataset simulated",
x="Order of Autoregressive Process",
y="Average Square Loss",
color="Risk Measure")
Interpretation
- Average empirical risk goes down with complexity measure \(d\)
- Complexity gives better past performance because we are looking at best past performance of any rule in \(\mathcal{F}\)
- A fixed rule may perform better or worse in future, with no reason to think it will be either
- But selecting based on past performance means finding models with good past performance, holding fixed future performance
- Generalization error and excess risk relative to oracle rise with \(d\)
- More complicated model gives more opportunity for overfitting
- If we care about population risk, must tradeoff empirical and generalization errors
- First goes down, second goes up with complexity
- For this simulation, optimum is reached at 0 complexity
- This is because data distribution has minimal risk in 0 parameter class
- Not general: if more complexity improves fit, empirical risk can go down faster and optimum reached at some intermediate \(d\)
- In general, nothing forces tradeoff to be optimized at a “true” class \(\mathcal{F}\)
- Generalization error depends basically just on size of \(\mathcal{F}\)
- Adding lots of terrible rules to \(\mathcal{F}\) just makes generalization error worse
- Adding low risk rules to \(\mathcal{F}\) makes generalization error worse, but total error better
Generalization Error Control: Alternatives
- One more way to learn about generalization error \(R_p(\widehat{f})-\widehat{R}(\widehat{f})\)
- Instead of an upper bound, produce an estimate
- \(\widehat{R}(\widehat{f})\) is known, so just need to estimate \(R_p(\widehat{f})\)
- If we have good estimate of generalization error, can be confident that an estimator with low empirical risk + estimated generalization error will perform well
- How to estimate \(R_p(\widehat{f})\)?
- \(\widehat{R}(\widehat{f})\) is an okay estimate if you already know generalization error is negligible
- But, especially for ERM, is extremely likely to be an underestimate
- When same data used for estimating risk and choosing model, model depends on error in risk estimate
- Produces trivial generalization error estimate
Validation in practice
- Better choice: Use independent sequence \(\{y_s\}_{s=1}^{S}\) with same distribution as \(\{y_t\}_{t=1}^T\), called a test set
- Estimate risk by empirical test set risk \(\frac{1}{S}\sum_{s=1}^{S}\ell(y_{s+h},\widehat{f}(\mathcal{Y}_s))\)
- Since \(\widehat{f}\) independent of test set, if distribution stationary and ergodic, can bound approximation error by LLN
- Don’t need uniform bounds since only have one function
- Problem: Unless data known to be iid, can’t obtain such a test set
- Or if distribution known, as in simulations above
- Alternative 1: Blocking
- Split sample into 3 time ordered subsets: Training set, middle, and Test set
- Run ERM on training set, estimate risk on test set, discard the middle
- If data stationary and weakly dependent and middle set large, test set close to independent of training
- Requires discarding data, making sample sizes small
- Alternative 2: forecasting from a rolling origin (time series Cross Validation)
- Perform ERM on first \(t\) and estimate loss from point \(t+h\) each time
- Doesn’t use test data for optimization at any given time point
- Many ERM iterations required, and not independent since overlapping, so analysis difficult
Meese and Rogoff, Revisited
- Meese and Rogoff looked at large collection of hypothesis classes for forecasting 1970s exchange rate data
- Mostly multivariate models with form and variables inspired by 1970s economic theory
- Variety of parameter selection procedures, but mostly Empirical Risk Minimization
- Training set is 1971-1978, test set is 1978-81 (no middle block)
- Loss functions are MAE, RMSE
- Conclusions
- Big complicated models produced by economists do very badly
- Best performer is naive (random walk) forecast: \(\widehat{y}_{t+h}=y_t\)
- Random walk forecast performs even better than AR forecast, even though random walk is special case of AR forecast
What does this tell us?
- What do ERM principles say about good classes of forecasting methods?
- Model should be simple, to reduce overfitting error
- Economic models have more parameters, whose values may fit to noise and reduce accuracy
- Fact that AR model (with parameter to choose) does worse than special case of AR model can only be due to overfitting
- For ERM to produce good estimate, data should be stationary and weak dependent
- Does this apply to exchange rates? We can look at their autocorrelation functions
- Linear or slower decay occurs when series is trending or otherwise nonstationary
- See this in all series, suggesting caution about this assumption
- Approximation error should decrease in length \(T\) of the sample time series
- 8 years of data is moderate amount for monthly series
- But amount error decreases with time depends on strength of dependence
- If strong, like ACF shows, need much longer series to reduce error same amount
Interpretation of Results
- Many economists interpreted study as saying that random walk is best economic model of exchange rates
- Models based on other variables do not help forecast movements
- High persistence of series is what you would expect to see from a random walk
- More on this in a later class
- But other models also have many features that make ERM forecast display large error
- This is true regardless of whether models are “true” in sense of providing better probability model of the data
- Forecasting and estimation are separate tasks
- If goal is to forecast, simplicity is a virtue!
Next class
- More hypothesis classes
- Finagling stationarity and weak dependence out of clearly nonstationary or highly dependent data
References
- Richard Meese and Kenneth Rogoff. “Empirical Exchange Rate Models of the Seventies. Do they Fit out of Sample?” Journal of International Economics 14 (1983) pp. 3-24
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. “Foundations of Machine Learning” MIT Press 2012
- Textbook treatment of statistical learning
- Mehryar Mohri and Afshin Rostamizadeh “Rademacher Complexity Bounds for Non-I.I.D. Processes” Advances in Neural Information Processing Systems (NIPS 2008). pp 1097-1104, Vancouver, Canada, 2009. MIT Press.
- Extension of textbook statistical learning results to time series data
- Bin Yu. “Rates of Convergence for Empirical Processes of Stationary Mixing Sequences” The Annals of Probability, Vol. 22, No. 1 (Jan., 1994), pp. 94-116
- Foundational paper on when ERM works for time series data