Regularization

Review of Generalization Problem and ERM
Structural Risk Minimization Principle
- Cross Validation
- Penalized Estimation
Penalization vs testing/model evaluation

Risk Minimization

In statistical learning approach, goal is to produce rule \(f\in\mathcal{F}\) with low risk for true distribution
- \(R_p(f)=E_p\ell(y_{T+h},f(\mathcal{Y}_{T}))\)
Empirical Risk Minimizer based on minimizing risk for empirical distribution \(\widehat{R}(f)=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))\)
The difference between what we want and what we achieve is the generalization error
- \(R_p(\widehat{f})=\widehat{R}(\widehat{f})+(R_p(\widehat{f})-\widehat{R}(\widehat{f}))\)
- Data can tell you whether first term is small: minimized by ERM
- Theory required to ensure second term also small
Can we do better by being more explicit about generalization error?
- Sometimes not much: “Fundamental Theorem of Machine Learning” (Mohri et al 2012)
Under same conditions that ERM has strong performance bound, no method can do “substantially better”
- In particular, if \(\mathcal{F}\) has “low complexity”, excess risk, which is already small, cannot be made much smaller
Does this mean we may as well always go with ERM?
- Not really! Especially for small samples \(T\) or classes with “high complexity”

Structural Risk Minimization

Ideal method would take into account both empirical risk and generalization error
Impossible to do exactly since generalization error part not known
- But it can be approximated or bounded!
Idea: Find a term related to generalization error, called a penalty \(\mathcal{J}(f)\)
Use a forecasting rule, called a Structural Risk Minimizer which minimizes empirical risk plus penalty \[\widehat{f}^{SRM}=\underset{f\in\mathcal{F}}{\arg\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))+\mathcal{J}(f))\]
If penalty exactly equals generalization error, this is exactly the oracle estimator
If penalized risk is a better approximation than the empirical risk, may still outperform ERM
So, how do we find a good penalty?
- Many proposed methods
- Today: A selection of popular penalties and discussion of their properties

Cross Validation

We have already seen cross validation as a way to assess forecast quality of a forecasting method \(\widehat{f}(\mathcal{Y}_{t})\)
Use a forecasting rule constructed on a training sample and estimates risk on different test sample
- May do this for many test and training samples, then average the results
- E.g. Risk estimated from a rolling forecast origin (cf tsCV) \[\widehat{R}^{tsCV}(\widehat{f})=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},\widehat{f}(\mathcal{Y}_{t}))\]
Many versions exist: test and samples can be blocks, weighted in different ways, etc
Eg: K-fold CV for ERM over independent data \(\{(y_t,z_t)\}_{t=1}^{T}\)
- Split sample into K sets \(\mathcal{T}_k\) of size \(T/K\)
- For \(k=1\ldots K\) estimate \(\widehat{f}^{(-k)}=\underset{f\in\mathcal{F}}{\arg\min}\frac{1}{T-T/K}\sum_{t\notin\mathcal{T}_k}\ell(y_t,f(z_t))\)
- \(\widehat{R}^{kfoldCV}=\frac{1}{T}\sum_{k=1}^{K}\sum_{t\in\mathcal{T}_k}\ell(y_t,\widehat{f}^{(-k)}(z_t))\)
- For k=T, called “Leave One Out” Cross Validation
CV reduces bias from overfitting by using different samples for evaluation and testing

Cross validation over multiple hypothesis classes

Cross validation can also be used as a way to choose between forecasting methods
- Simply choose the method that minimizes the cross-validation estimate of risk
- If methods \(\widehat{f}\) are just fixed functions, for tsCV this is just ERM again
Case where CV is useful: when we have multiple hypothesis classes of rules \(\{\mathcal{F}_c:c\in\mathcal{C}\}\)
- \(c\) is a tuning parameter affecting some aspect of the procedure, often a measure of complexity
- Eg for AR(p) classes, \(\mathcal{F}_c=\{\beta_0+\sum_{j=1}^{p}\beta_{j}y_{t-j}:\ \beta\in\mathcal{R}^{p+1}\}\), tuning parameter is \(p\)
- For each \(c\in\mathcal{C}\), run ERM over \(\mathcal{F}_c\)
  \[\widehat{f}^{c}(\mathcal{Y}_t)=\underset{f\in\mathcal{F}_c}{\arg\min}\frac{1}{t-h}\sum_{s=1}^{t-h}\ell(y_{s+h},f(\mathcal{Y}_{s}))\]
- Choose \(\widehat{c}^{CV}\) by minimizing Cross-Validation risk of \(\widehat{f}^{c}()\) \[\widehat{c}^{CV}=\underset{c\in\mathcal{C}}{\arg\min}\widehat{R}^{CV}(\widehat{f}^{c})\]
- Forecast using \(\widehat{f}^{\widehat{c}^{CV}}\)

Evaluating Cross Validation

Cross validation over multiple classes not the same as ERM because \(\widehat{f}^{c}\) is not a fixed function but depends on the data
Often, classes are ordered: \(\mathcal{F}_{c_1}\subset \mathcal{F}_{c_2}\) if \(c_1<c_2\)
- Compare small classes of low complexity to big classes of high complexity: More lags, more covariates, etc
Low complexity classes may have high empirical risk but low error on test set, vice versa for high complexity classes
Cross validation accounts for these differences: won’t always pick the highest complexity class
Still have some approximation error, because test set distribution \(\neq\) future population distribution
- Error increases in complexity of \(\mathcal{C}\), which is usually small: one or two parmeters
- Worse when training and test samples are dependent: helped slightly by using tsCV over k-fold
In general, provides fairly good approximation to generalization error but can be extremely costly to compute
- Solve an optimization problem for each \(c\in\mathcal{C}\) for each \(t\) in sample
- Can be hundreds or thousands of times as many computations as a single ERM forecast
Variants or special cases which are less costly but provide similar or not much worse performance may be preferred

Information Criteria

Look for a simple measure \(\mathcal{J}()\) of generalization error of rule \(f\) using a measure of complexity of class \(\mathcal{F}_c\), \(\mathcal{J}(\mathcal{F}_c)\)
Note that in ERM, scale of risk doesn’t matter: Applying increasing transform to risk gives same minimizer
With penalization, effect on minimizer does depend on relative scale of criterion and penalty
Use a normalized risk measure, called a log likelihood to ensure constant scale
- Takes form \(-2\log(p(\mathcal{Y}_T,f))\) where \(p(\mathcal{Y}_T,f)\) is a probability distribution over \(\mathcal{Y}_T\)
- Normalized form for square loss is \((T-h)\log(\frac{1}{T-h}\sum_{t=1}^{T-h}(y_{t+h}-f(.))^2)\)
- Cross entropy loss for binary prediction already normalized
A penalized log likelihood adds a penalty \(\mathcal{J}\) to normalized risk
Criterion value of class \(c\) is the minimum value of the penalized log likelihood \(\underset{f\in\mathcal{F}_c}{\min}(-2\log(p(\mathcal{Y}_T,f)+\mathcal{J}(\mathcal{F}_c))\)
Common penalties based on number of parameters p
- Akaike Information Criterion, AIC: \(\mathcal{J}(\mathcal{F}_c))=2p\)
- Bayesian Information Criterion, BIC: \(\mathcal{J}(\mathcal{F}_c))=2p\log(T-h)\) (also called Schwarz Criterion, SC)
Choose a rule which minimizes the penalized log likelihood \(\widehat{f}^{\mathcal{J}}=\underset{f\in\mathcal{F}_c,c\in\mathcal{C}}{\arg\min}(-2\log(p(\mathcal{Y}_T,f)+\mathcal{J}(\mathcal{F}_c))\)

Evaluation and implementation

AIC and BIC designed to pick a rule with low risk
BIC has slightly larger penalty: downgrades more complex models more
- Under some very strong conditions, AIC “close to” true risk, at least when T is large
- BIC instead designed to pick “true model” when T is large
For forecasting usually care about formaer rather than latter goal
- For this reason, Hyndman & Athanasopolous text recommends AIC
But be cautious: result requires large samples, and in small samples AIC can be bad measure of risk
- For this reason, Diebold text recommends BIC, as slightly larger penalty accounts for chance of overfitting from misestimation of risk
Both estimates usually provide worse measure of risk than cross validation
- Sometimes faster: \(\approx T\) times fewer optimizations
  - Exception is linear models, where known formula simplifies CV calculations
Automatically reported with Arima, VAR, ets, etc, or ex post by CV
- AIC also part of auto.arima, to choose between orders of AR
- VARselect computes AIC/BIC for VAR over loss \(\log\det(\sum_{t=1}^{T-h}(y_{t+h}-\widehat{y}_{t})(y_{t+h}-\widehat{y}_{t})^\prime)\)

Parameter size based penalties

Penalization based on number of parameters require solving model for each parameter set \(c\in\mathcal{C}\)
- Fine if only a small number: once per lag order of AR/VAR up to max lag, hard if sets \(\mathcal{F}_c\) are every possible subset of p regressors: \(2^p\) regressions
Works if best predictions come from ignoring all but a subset of predictors, and leaving rest unrestricted
Alternative: allow all predictors to matter, but reduce total size of parameters by penalizing their norm
Linear hypothesis classes \(\mathcal{F}=\{\beta_0+\sum_{j=1}^m\beta_j z_j:\ \beta\in\mathbb{R}^{m+1}\}\), penalty is \(\lambda\Vert\beta\Vert^p_p\)
Norm \(\Vert\beta\Vert_p\) is a measure of “size” of coefficients, e.g.
- L1 (LASSO) Penalty \(\Vert\beta\Vert^1_1=\sum_{j=0}^d\|\beta_j\|\)
- L2 (Ridge) Penalty \(\Vert\beta\Vert^2_2=\sum_{j=0}^d(\beta_j)^2\)
- Mixed L1/L2 (Elastic Net) Penalty \(\Vert\beta\Vert_{1/2,a}=a\Vert\beta\Vert^1_1+(1-a)\Vert\beta\Vert^2_2\) for \(a\in[0,1]\)
\(\lambda\geq 0\) decides size of penalty term
- Large value means strong restrictions: coefficients shrink a lot and class much simpler
- Small value allows more complexity, at cost of greater overfitting risk. 0 means no restrictions: estimator is just ERM
Penalized estimator solves \(\widehat{f}^\lambda=\underset{f\in\mathcal{F}}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\lambda\Vert\beta\Vert_p^p)\)

Evaluation and Implementation

Norm penalized estimators address overfitting by preventing large fluctuations in values of parameters
Type of norm controls which kinds of fluctuations are more or less restricted
L2 penalty more strongly affected by very large values, less by small values
- All values partially shrink
L1 penalty equally strongly affected by small and large values
- Allows some larger coefficients, but shrinks a lot smaller ones, often to exactly 0: sparsity
- Acts like penalties on # of parameters, but non-0 coefficients also slightly shrunk
Elastic net in between, by amount controlled by \(a\): small terms go to 0, but big terms also shrunk
For a fixed penalty size \(\lambda\) computation is easy: one instead of many optimizations
Scale problem: how to choose \(\lambda\)?
- Typically, find \(\widehat{f}^\lambda\) for many \(\lambda\) and select \(\widehat{\lambda}^{CV}\) by cross validation
- Efficient LARS algorithm makes computation for all \(\lambda\) fast in linear model
Implemented for linear hypotheses and square loss, cross-entropy, multivariate square loss, etc in glmnet
- cv.glmnet implements cross validation parameter selection, but not time series version

What kinds of model classes?

Penalizing size of parameters can substitute for using smaller number of parameters
Can include more predictors for same level of generalization error if penalized
- Can use many predictors, or nonlinear functions of predictors: polynomials, splines, etc
- Benefit is better ability to predict and get close to oracle predictor
- Cost is that values of parameters get distorted
How complicated can model class be?
- Depends on number of possible relevant predictors and how big complicated a model within that class
LASSO can handle more parameters than data points, if most are (close to) 0
- Up to exponentially many if very sparse and large \(\lambda\) chosen
Ridge can handle arbitrary number of predictors
- But effects of each become progressively more muted
In practice, use LASSO with many weakly related predictors, of which only some might matter
Use ridge with many, possibly related predictors, each of which contributes a little bit to the forecast
- Often believable for lags, so common to use in (V)AR estimates

Function based penalties

Rather than penalizing parameters of predictor \(f()\), you can penalize \(f\) directly
Penalized estimator solves, for \(\lambda\Vert f\Vert\) some norm on \(f\),
- \(\widehat{f}^\lambda=\underset{f\in\mathcal{F}}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\lambda\Vert f\Vert)\)
For classes \(\mathcal{F}\) defined by parameters, like linear hypothesis class, little essential difference
- (Usually) equivalent to some penalty on parameters
But if class \(\mathcal{F}\) isn’t dependent on a (finite) set of parameters, eg \(\mathcal{F}=\{\text{all functions of Z}\}\), can be very different
Most prominent example: mean squared second derivative penalty \(\Vert f\Vert=\int f^{\prime\prime}(z)^2dz\)
- Penalizes “wiggliness” regardless of particular shape: minimal for straight line
Amazing fact (cf Wahba 1990): penalized minimizer over class of all functions has known, computable form
- Called a “Smoothing Spline”: piecewise cubic polynomial between data points
- Equivalent to ridge penalty over (T-dimensional) class of piecewise cubic polynomials
Implemented as smooth.spline as well as in specialized functions in libraries gam or mgcv
Choose \(\lambda\) by CV again: fast algorithm known for non-time series case

Smoothing Splines: GDP growth from Lagged GDP growth

library(fpp2) #Forecasting
library(glmnet) #Lasso and Ridge
library(vars) #Vector Autoregressions
library(fredr) #Access FRED Data

##Obtain and transform NIPA Data (cf Lecture 08)

fredr_set_key("8782f247febb41f291821950cf9118b6") #Key I obtained for this class

#US GDP components 
GDP<-fredr(series_id = "GDP",
           observation_start = as.Date("1947-01-01")) #Gross Domestic Product

#US GDP components from NIPA tables (cf http://www.bea.gov/national/pdf/nipaguid.pdf)
PCEC<-fredr(series_id = "PCEC",
            observation_start = as.Date("1947-01-01")) #Personal consumption expenditures
FPI<-fredr(series_id = "FPI",
           observation_start = as.Date("1947-01-01")) #Fixed Private Investment
CBI<-fredr(series_id = "CBI",
           observation_start = as.Date("1947-01-01")) #Change in Private Inventories
NETEXP<-fredr(series_id = "NETEXP",
              observation_start = as.Date("1947-01-01")) #Net Exports of Goods and Services
GCE<-fredr(series_id = "GCE",
           observation_start = as.Date("1947-01-01")) #Government Consumption Expenditures and Gross Investment

#Format the series as quarterly time series objects, starting at the first date
gdp<-ts(GDP$value,frequency = 4,start=c(1947,1),names="Gross Domestic Product") 
pcec<-ts(PCEC$value,frequency = 4,start=c(1947,1),names="Consumption")
fpi<-ts(FPI$value,frequency = 4,start=c(1947,1),names="Fixed Investment")
cbi<-ts(CBI$value,frequency = 4,start=c(1947,1),names="Inventory Growth")
invest<-fpi+cbi #Private Investment
netexp<-ts(NETEXP$value,frequency = 4,start=c(1947,1),names="Net Exports")
gce<-ts(GCE$value,frequency = 4,start=c(1947,1),names="Government Spending")

#Convert to log differences to ensure stationarity and collect in frame
NIPAdata<-ts(data.frame(diff(log(gdp)),diff(log(pcec)),diff(log(invest)),diff(log(gce))),frequency = 4,start=c(1947,2))

# Construct lags

l1NIPA<-window(stats::lag(NIPAdata,-1),start=c(1949,2),end=c(2018,3))
l2NIPA<-window(stats::lag(NIPAdata,-2),start=c(1949,2),end=c(2018,3))
l3NIPA<-window(stats::lag(NIPAdata,-3),start=c(1949,2),end=c(2018,3))
l4NIPA<-window(stats::lag(NIPAdata,-4),start=c(1949,2),end=c(2018,3))
l5NIPA<-window(stats::lag(NIPAdata,-5),start=c(1949,2),end=c(2018,3))
l6NIPA<-window(stats::lag(NIPAdata,-6),start=c(1949,2),end=c(2018,3))
l7NIPA<-window(stats::lag(NIPAdata,-7),start=c(1949,2),end=c(2018,3))
l8NIPA<-window(stats::lag(NIPAdata,-8),start=c(1949,2),end=c(2018,3))

#convert to matrices
xvarsdf<-data.frame(l1NIPA,l2NIPA,l3NIPA,l4NIPA,l5NIPA,l6NIPA,l7NIPA,l8NIPA)
xvars<-as.matrix(xvarsdf)
NIPAwindow<-matrix(window(NIPAdata,start=c(1949,2),end=c(2018,3)),length(l1NIPA[,1]),4)

#Spline models, yeah?
library(gam) #loads smoothing spline package
lagDlogGDP<-xvars[,1]
DlogGDP<-NIPAwindow[,1]
#Smoothing spline Regression of GDP on lagged GDP
# Default behavior fits penalty parameter by generalized Cross Validation

# Predict GDP growth using smoothing spline in lagged GDP growth
gdp.spl<-smooth.spline(lagDlogGDP,DlogGDP)
splinepredictions<-predict(gdp.spl)

\(\lambda=\) 0.573289 chosen by (non-time series) cross validation gives almost exactly linear fit

#Build Plot
gdpdata<-data.frame(xvars[,1],NIPAwindow[,1],splinepredictions$x,splinepredictions$y)
colnames(gdpdata)<-c("LDlGDP","DlGDP","xspline","yspline")

ggplot(gdpdata,aes(x=LDlGDP,y=DlGDP))+geom_point()+
  geom_line(aes(x=xspline,y=yspline),color="red",size=1)+
  ggtitle("Smoothing Spline Fit of This vs Last Quarter's GDP Growth",
          subtitle="Almost exactly linear due to large optimal penalty for nonlinearity")+
  xlab("Change in Log GDP, t-1")+ylab("Change in Log GDP, t")

# An alternate approach which uses "gam" command to fit spline
# Advantage is that it's built in to ggplot
# Disadvantage is that it uses only default options, so not customizable
# In this case, results are exactly the same

# ggplot(gdpdata,aes(x=LDlGDP,y=DlGDP))+geom_point()+
#   ggtitle("Spline and OLS Fit of This vs  Last Quarter's GDP Growth",
#           subtitle="There are two lines which exactly overlap")+
#   xlab("Change in Log GDP, t-1")+ylab("Change in Log GDP, t")+
#   geom_smooth(method=gam,se=FALSE,color="blue")+
#   geom_smooth(method=lm,se=FALSE,color="red")

Exotic penalties

Example penalties I mentioned are just most common ones
New and different penalties can be used to emphasize different features of the prediction or type of data used
- Thousands in statistics/ML literature, and new ones every day
Parameter-based penalties can emphasize different sets of parameters or features of size or shape
Function based penalties can look at fine or large scale features, or multiple variables
- Cases which allow finite computable solution are called (reproducing) kernel methods
Shared principle: want to trade off flexible form to match patterns in observed data against size and complexity of class
- Different prediction tasks will call for different shapes, sizes of data, so specialized penalties can help

Penalties versus Bounds

Alternately, can just run ERM over predictor class \(\mathcal{F}\) with restriction that \(\Vert f \Vert \leq c\) for some c
- \(\widehat{f}^{ERM,c}=\underset{f\in\mathcal{F}, \Vert f \Vert\leq c}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t)))\)
E.g. only look at functions with norm of predictors bounded \(\mathcal{F}_c=\{\beta_0+\sum_{j=1}^d\beta_jz_j:\ \beta\in\mathbb{R}^{d+1}, \Vert\beta\Vert_p\leq c\}\)
Can choose \(c\) by cross-validation, just like with penalty
Claim: ERM over class with bound imposed is numerically identical to some penalized empirical risk minimizer
More formally, for any \(c>0\), there is some \(\widehat{\lambda}(c)\) such that
- \(\widehat{f}^{ERM,c}=\underset{f\in\mathcal{F}}{\arg\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\widehat{\lambda}(c)\Vert f\Vert)\)
This is application of Lagrange multiplier theorem from constrained optimization
Slight difference is that \(\widehat{\lambda}(c)\) may depend on the data, whereas it is fixed in penalized case
- Though since \(c\) and \(\lambda\) usually chosen based on data anyway, even this usually not a difference
Thus, penalization inherits properties of ERM, but with a new, restricted class of predictors
- Explains variety of possible penalties: just different functions of the data

Regularization for GDP Forecasting

library(dplyr) #Data manipulation
library(knitr) #Use knitr to make tables
library(kableExtra) #Extra options for tables
library(tibble) #More data manipulation
#Calculate information criteria
IC<-VARselect(NIPAdata,lag.max=8)

# Fit Optimal VARs by AIC, BIC
AICVAR<-VAR(NIPAdata,p=IC$selection[1])
BICVAR<-VAR(NIPAdata,p=IC$selection[3])

#Construct 1 period ahead foecasts
AICpred<-forecast(AICVAR,h=1)$forecast
BICpred<-forecast(BICVAR,h=1)$forecast

#Extract point forecasts
AICfcsts<-select(data.frame(AICpred),ends_with("Forecast"))
BICfcsts<-select(data.frame(BICpred),ends_with("Forecast"))


#Fit LASSO VAR in all variables up to 8 lags
lassovar<-glmnet(xvars,NIPAwindow,family="mgaussian")

#Show range of lambda values from unpenalized to where all variables gone
#lassovar$lambda 

#Show coefficients for large value of lambda (heavy penalization, most coefs gone)
#coef(lassovar,s=0.02)

#Show coefficients for intermediate value of lambda (medium penalization, some coefs gone)
#coef(lassovar,s=0.008)

#Show coefficients for small value of lambda (little penalization, most coefs remain)
#coef(lassovar,s=0.001)

## Fit LASSO VAR with 10-fold cross-validation
lassovarCV<-cv.glmnet(xvars,NIPAwindow,family="mgaussian")

#Lambda giving minimum 10-fold Cross Validation Error
CVlambda<-lassovarCV$lambda.min

#Baroque construction to take last period and format it for predict command
#Somehow need 2 rows in prediction value or it throws a fit 
#Presumably because matrix coerced to vector and then it thinks it can't multiply anymore
newX<-t(as.matrix(data.frame(as.vector(t(NIPAwindow[270:277,])), as.vector(t(NIPAwindow[271:278,])))))

#Predict 2018Q3 series
cvlassopredictions<-window(ts(predict(lassovarCV,newx = newX)[,,1],start=c(2018,3),frequency=4),start=c(2018,4))

#Fit L2 Penalized VAR in all variables up to 8 lags
ridgevar<-glmnet(xvars,NIPAwindow,alpha=0,family="mgaussian")

#Show range of lambda values from unpenalized to where all variables gone
#ridgevar$lambda 

#Show coefficients for large value of lambda (heavy penalization, coefs shrink a lot)
#coef(ridgevar,s=10)

#Show coefficients for intermediate value of lambda (medium penalization, coefs shrink a little)
#coef(ridgevar,s=1)

#Show coefficients for small value of lambda (little penalization, coefs close to OLS values)
#coef(ridgevar,s=0.01)

## Fit Ridge VAR with 10-fold cross-validation
ridgevarCV<-cv.glmnet(xvars,NIPAwindow,alpha=0,family="mgaussian")

#Lambda giving minimum 10-fold Cross Validation Error
CVlambda2<-ridgevarCV$lambda.min

#Predict 2018Q3 series
cvridgepredictions<-window(ts(predict(ridgevarCV,newx = newX)[,,1],start=c(2018,3),frequency=4),start=c(2018,4))

# #Plot Data Series
# autoplot(window(NIPAdata,start=c(2012,4)))+
#   autolayer(cvlassopredictions)+
#   autolayer(cvridgepredictions)

#  autolayer(AICpred$diff.log.gdp..$mean)+
 # autolayer(BICpred$diff.log.gdp..$mean)

Let’s go back to our GDP forecasting example using history of changes in log of \(Y\), \(C\), \(I\), \(G\)
Predictor class will be Vector Autoregressions in up to 8 lags of all 4 variables
- 33 parameters per equation, 132 total
Estimate by AIC, BIC, LASSO, and Ridge
AIC selects 5 lags, BIC selects 1 lag, reflecting smaller penalty for AIC
10-fold cross validation selects \(\lambda=\) 0.0026361 for LASSO, \(\lambda=\) 0.2138204 for ridge
LASSO keeps 32 nonzero parameters, while ridge chooses non-zero value for all 132
- Compare 8 for BIC, 24 for AIC, but different choices from LASSO

#Collect 2018Q4 Forecasts into a data frame
var1table<-data.frame(
  t(cvridgepredictions),
  t(cvlassopredictions),
  t(AICfcsts),
  t(BICfcsts)
)
rownames(var1table)<-c("GDP","Consumption","Investment","Government")
#Make Table of Estimated Coefficients
kable(var1table,
  col.names=c("Ridge","LASSO","AIC","BIC"),
  caption="2018Q4 Forecasts from Different Methods") %>%
  kable_styling(bootstrap_options = "striped", font_size = 16)

2018Q4 Forecasts from Different Methods
	Ridge	LASSO	AIC	BIC
GDP	0.0135764	0.0136399	0.0114672	0.0146535
Consumption	0.0145109	0.0150744	0.0114084	0.0137662
Investment	0.0141675	0.0096423	0.0147074	0.0200288
Government	0.0112001	0.0129239	0.0109967	0.0142961

Ridge Coefficients

#Make Coefficient Table
zzz<-data.frame(as.matrix(coef(ridgevarCV)$y1),
                as.matrix(coef(ridgevarCV)$y2),
                as.matrix(coef(ridgevarCV)$y3),
                as.matrix(coef(ridgevarCV)$y4))


rownames(zzz)<-rownames(data.frame(as.matrix(coef(ridgevarCV)$y1)))
zzz %>%
mutate_all(function(x) {
    cell_spec(x, bold = T,
              color = spec_color(x))
  }) %>%
  kable(escape=F,
  col.names=c("GDP","C","I","G"),
  caption="Coefficients in Ridge Regression") %>%
  kable_styling(bootstrap_options = "striped", font_size=10)

Coefficients in Ridge Regression
GDP	C	I	G
0.00786403925303112	0.00788037586534336	0.0148332433942633	0.00238697200569118
0.0865957812778728	0.0551696965390487	0.244556969538606	0.0942090494910388
0.109959994411296	0.0140186300705327	0.652435086528036	0.0271147781051111
0.0123857804601417	0.0146716727038759	0.0263673253409791	0.00451694512000495
0.0106451182691282	-0.000384426841756296	-0.10326596491934	0.129021260675576
0.0621757521642605	0.0507324374153117	0.0591602476118126	0.0915652117483974
0.0685132890145487	0.0872907695940012	0.0201799922326903	0.0757765565334758
0.00980081641152812	0.00612229583374719	0.0264123256900968	0.00341895148384484
-0.00091364858455345	-0.0183671961947785	-0.0452636148290377	0.0728980049798622
0.00222284364167427	0.000153134673311346	-0.128645457778679	0.0690242604269709
0.0281940450636521	0.0281374237646931	-0.0171630376438789	0.0241099553055666
-0.000742106997763265	-0.00393726953375673	-0.0148823646684543	0.00991134660015443
-0.0117103950975583	0.00390704975726952	-0.124060760551806	0.042365378160212
0.00466436718790951	0.0198055138385727	-0.11583119132355	0.0696018152897233
0.0261427471133706	0.0118711044130767	0.0680154676403516	0.0163018495414982
-0.00167161870530863	0.00573285857652937	-0.0541920520539569	0.0168684912766791
-0.00858261369081641	-0.000107206981960806	-0.0348551138082085	0.0112101644913383
-0.0232727042683797	-0.00429662421540572	-0.167820490545611	0.0172672049947349
0.00391586808909369	0.0427292362405572	-0.200572063884623	0.0349239157634509
-0.00664742359679361	-0.0104016875491473	-0.0190333240075376	0.00595507032903712
0.00538793204213564	0.0128902984939667	0.02284614151604	-0.0114685479470082
0.00544131900827601	0.0144369960515328	-0.0303801828996868	-0.00363175835548809
0.0164253253291509	0.018465576565109	0.0345073433829858	-0.0126745355744008
-0.00190203067553394	5.72598991601035e-05	-0.0190653286699528	0.00404537508886109
0.0197841229410908	0.0160085909139529	0.0842755170708496	0.00830309472111045
0.0120147540294824	0.0243376461687585	-0.0165184864873954	0.00719547800775078
0.040296252839528	0.0596415983729323	0.02774714220003	0.0114532053990684
-0.0024274892654373	-0.0029465709365635	0.00107392698826947	-0.00148227405584982
0.0143235423939472	0.0095686832241577	0.0646786514144629	0.00189470282102794
0.00654277888835529	0.0260428281389034	-0.0835351764832077	0.00556255654741
0.00740736376961867	0.0193445802289786	-0.0299715588564189	0.0107470604622051
0.00244101702686358	0.00315971586949353	0.00301646433765767	-0.00377054987395623
-0.0048771079424408	-0.000351952561224234	-0.0574670100128757	0.010569599656017

LASSO Coefficients

#Make Coefficient Table
zzz<-data.frame(as.matrix(coef(lassovarCV)$y1),
                as.matrix(coef(lassovarCV)$y2),
                as.matrix(coef(lassovarCV)$y3),
                as.matrix(coef(lassovarCV)$y4))

rownames(zzz)<-rownames(data.frame(as.matrix(coef(lassovarCV)$y1)))
zzz %>%
mutate_all(function(x) {
    cell_spec(x, bold = T,
              color = if_else(x!=0,"red","black"))
  }) %>%
  kable(escape=F,
  col.names=c("GDP","C","I","G"),
  caption="Coefficients in LASSO Regression") %>%
  kable_styling(bootstrap_options = "striped", font_size=10)

Coefficients in LASSO Regression
GDP	C	I	G
0.0100302004924729	0.0129798197145501	0.00168310308800569	0.00756634626175604
0	0	0	0
0.349959458943998	0.130476409727349	1.56554483539441	0.175646554336095
0	0	0	0
0.0189475717511366	0.00825380129907898	-0.200786258486333	0.211575231022215
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0.00317825905700754	0.00415960881665426	-0.0480514970684653	0.026735352282484
0	0	0	0
0	0	0	0
-0.000579291079324646	0.00107644280868163	-0.0116456790667297	0.00480000882876038
0	0	0	0
0	0	0	0
0.000140488939523953	0.00533714789277771	-0.0489610167268404	0.0198565317923048
0	0	0	0
-0.0248114736550768	0.00358810219897533	-0.246221267350789	0.0430723482194091
0.00499345322235313	0.0241926374897687	-0.102657716554927	0.0268686037123245
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0
0	0	0	0

AIC Coefficients

#Make Coefficient Table
zzz<-data.frame(AICVAR$varresult$diff.log.gdp..$coefficients,
                AICVAR$varresult$diff.log.pcec..$coefficients,
                AICVAR$varresult$diff.log.invest..$coefficients,
                AICVAR$varresult$diff.log.gce..$coefficients)

rownames(zzz)<-rownames(data.frame(AICVAR$varresult$diff.log.gdp..$coefficients))
zzz %>%
mutate_all(function(x) {
    cell_spec(x, bold = T,
              color = if_else(x!=0,"red","black"))
  }) %>%
  kable(escape=F,
  col.names=c("GDP","C","I","G"),
  caption="Coefficients in AIC Regression") %>%
  kable_styling(bootstrap_options = "striped", font_size=10)

Coefficients in AIC Regression
GDP	C	I	G
0.0142423802331418	0.335379954233719	-1.66567424374834	-0.0458342811491882
0.412581978510042	-0.145870908410638	3.63202838343984	0.143896666113282
0.0274378660517431	-0.0156497514807466	0.287226644977188	0.042002491417178
0.0643892203157539	-0.0233881252807245	0.0174160286406101	0.472850970012959
0.229659888723582	0.269399415045273	-0.143262388462611	0.351609194937956
0.0702970032085799	0.173522014804767	-0.0975467195269845	0.0158414129010053
-0.0171415190305722	-0.0323539077927016	0.0475055185498714	-0.0412019143794856
-0.0420896176116723	-0.158539060486606	0.177660920656683	0.0443322148452024
-0.0530503159805621	-0.105453086053541	-0.417634348755004	0.0518474932711676
0.0417016846902634	0.159668297181366	-0.283642111390404	-0.128775449751194
-0.0209115301435752	-0.0200053835083251	-0.0549504273458671	0.030162245635792
-0.0244971810971676	0.0310810988172909	-0.186714858168813	0.072437242832677
0.194846774896595	-0.0456390218211937	1.69673325899881	0.291405949868535
0.0438158991226434	0.118599212590941	-0.318354688568522	-0.266886732092782
-0.0436234814162472	0.0124385525781957	-0.371750797774807	0.00287302373823534
-0.0939728362880406	0.0256517076994152	-0.641177916328012	-0.0349723904845601
-0.375654283804274	0.0714799076733018	-1.78345775966185	-0.427496739634625
0.214299611678092	0.0808915731483531	0.421133953050024	0.275045376137453
0.0179499285089324	-0.0456210321684576	0.139224678770525	0.055414493021536
0.115882613339792	0.0566722429832534	0.423603489217224	0.0137362064712486
0.00327979811668631	0.00411587418405196	0.00146212088776681	0.00125004595756258

BIC Coefficients

#Make Coefficient Table
zzz<-data.frame(BICVAR$varresult$diff.log.gdp..$coefficients,
                BICVAR$varresult$diff.log.pcec..$coefficients,
                BICVAR$varresult$diff.log.invest..$coefficients,
                BICVAR$varresult$diff.log.gce..$coefficients)

rownames(zzz)<-rownames(data.frame(BICVAR$varresult$diff.log.gdp..$coefficients))
zzz %>%
mutate_all(function(x) {
    cell_spec(x, bold = T,
              color = if_else(x!=0,"red","black"))
  }) %>%
  kable(escape=F,
  col.names=c("GDP","C","I","G"),
  caption="Coefficients in BIC Regression") %>%
  kable_styling(bootstrap_options = "striped", font_size=10)

Coefficients in BIC Regression
GDP	C	I	G
0.121952811213594	0.443954168101468	-1.96535758395511	0.127651524807346
0.407141812161502	-0.1140622757727	3.58351024967781	0.00635501513565612
0.0354252888148711	-0.00443586493533627	0.360026691647868	0.0435240022800619
0.00925600061424106	-0.0503410167955548	-0.389919533571099	0.598248368556429
0.00650643901357845	0.0114194256502141	-0.00914962148014678	0.0035845689397845

Evaluating Autoregressive Forecasts

Forecast residuals from optimal forecast \(\widehat{e}_{t+1}=y_{t+1}-\widehat{y}_{t+1}\) are white noise in autoregressive model
This is in fact true when forecast rule is Bayes predictor in square loss for any stationary distribution
- Holds for VARs, regression, etc
So if you are using a rule which is approximately the Bayes rule, residuals should be close to white noise
- ACF of \(\widehat{e}_{t+1}\) should be close to 0 everywhere
- Use checkresiduals to inspect and compare various measures of distance
Usually little reason to believe chosen family contains Bayes forecast rule
- May not be exactly linear, or may depend on information further back than number of lags used
- Or even if it did that approximation is close enough that all properties of Bayes forecast hold for approximate rule
These two issues are in conflict
- Can always increase complexity of hypothesis class to try to get closer to Bayes forecaster
- But doing so can worsen overfitting, making rule actually chosen further away
Residual checks measure closeness of fit, but are unrelated to degree of overfitting
- Generally not a reliable tool for finding a good forecast, as opposed to checking ex post
- Prefer measures which account for all types of error

Application to GDP Series in BIC VAR Prediction

checkresiduals(BICVAR$varresult$diff.log.gdp..) #GDP Growth forecast residual diagnostic

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 36.01, df = 10, p-value = 8.386e-05

Patterns in residual plot and ACF, and formal test of white noise, suggest VAR(1) is not super close to white noise
- Suggests patterns in data may differ somewhat from VAR(1) model: in-sample fit could be improved
- To decide whether this or another forecast should be used, should compare loss measures

Conclusions

ERM does well for choosing within a simple hypothesis classes
Minimizing penalized criterion is a sensible way to improve accounting for overfitting
- Based on number (AIC/BIC) or size (L1/L2) of parameters or functions
When choosing between multiple procedures, including ERM over multiple classes, can use measure of risk for each
- Cross-validation
Penalization acts like restricting size of parameter class
- Can choose penalty by cross-validating
Goal of regularization is to trade off in-sample and out-of-sample fit
- Other tools exist for finding the “right” descriptive model

References

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. “Foundations of Machine Learning” MIT Press 2012
Grace Wahba “Spline Models for Observational Data” (1991)
- Overview of smoothing splines and penalization,
- Explains how and why one can work with all differentiable functions

Regularization

73-423 Forecasting for Economics and Business

Regularization

Risk Minimization

Structural Risk Minimization

Cross Validation

Cross validation over multiple hypothesis classes

Evaluating Cross Validation

Information Criteria

Evaluation and implementation

Parameter size based penalties

Evaluation and Implementation

What kinds of model classes?

Function based penalties

Smoothing Splines: GDP growth from Lagged GDP growth

Exotic penalties

Penalties versus Bounds

Regularization for GDP Forecasting

Ridge Coefficients

LASSO Coefficients

AIC Coefficients

BIC Coefficients

Evaluating Autoregressive Forecasts

Application to GDP Series in BIC VAR Prediction

Conclusions

References