Regularization

Review of Generalization Problem and ERM
Structural Risk Minimization Principle
- Cross Validation
- Penalized Estimation

Risk Minimization

In statistical learning approach, goal is to produce rule \(f\in\mathcal{F}\) with low risk for true distribution
- \(R_p(f)=E_p\ell(y_{T+h},f(\mathcal{Y}_{T}))\)
Empirical Risk Minimizer based on minimizing risk for empirical distribution \(\widehat{R}(f)=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))\)
The difference between what we want and what we achieve is the generalization error
- \(R_p(\widehat{f})=\widehat{R}(\widehat{f})+(R_p(\widehat{f})-\widehat{R}(\widehat{f}))\)
- Data can tell you whether first term is small: minimized by ERM
- Theory required to ensure second term also small
Can we do better by being more explicit about generalization error?
- Sometimes not much: “Fundamental Theorem of Machine Learning” (Mohri et al 2018)
Under same conditions that ERM has strong performance bound, no method can do “substantially better”
- In particular, if \(\mathcal{F}\) has “low complexity”, excess risk, which is already small, cannot be made much smaller
Does this mean we may as well always go with ERM?
- Not really! Especially for small samples \(T\) or classes with “high complexity”

Structural Risk Minimization

Ideal method would take into account both empirical risk and generalization error
Impossible to do exactly since generalization error part not known
- But it can be approximated or bounded!
Idea: Find a term related to generalization error, called a penalty \(\mathcal{J}(f)\)
Use a forecasting rule, called a Structural Risk Minimizer which minimizes empirical risk plus penalty \[\widehat{f}^{SRM}=\underset{f\in\mathcal{F}}{\arg\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))+\mathcal{J}(f))\]
If penalty exactly equals generalization error, this is exactly the oracle estimator
If penalized risk is a better approximation than the empirical risk, may still outperform ERM
So, how do we find a good penalty?
- Many proposed methods
- Today: A selection of popular penalties and discussion of their properties

Cross Validation

We have already seen cross validation as a way to assess forecast quality of a forecasting method \(\widehat{f}(\mathcal{Y}_{t})\)
Use a forecasting rule constructed on a training sample and estimates risk on different test sample
- May do this for many test and training samples, then average the results
- E.g. Risk estimated from a rolling forecast origin (cf tsCV) \[\widehat{R}^{tsCV}(\widehat{f})=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},\widehat{f}(\mathcal{Y}_{t}))\]
Many versions exist: test and samples can be blocks, weighted in different ways, etc
Eg: K-fold CV for ERM over independent data \(\{(y_t,z_t)\}_{t=1}^{T}\)
- Split sample into K sets \(\mathcal{T}_k\) of size \(T/K\)
- For \(k=1\ldots K\) estimate \(\widehat{f}^{(-k)}=\underset{f\in\mathcal{F}}{\arg\min}\frac{1}{T-T/K}\sum_{t\notin\mathcal{T}_k}\ell(y_t,f(z_t))\)
- \(\widehat{R}^{kfoldCV}=\frac{1}{T}\sum_{k=1}^{K}\sum_{t\in\mathcal{T}_k}\ell(y_t,\widehat{f}^{(-k)}(z_t))\)
- For k=T, called “Leave One Out” Cross Validation
CV reduces bias from overfitting by using different samples for evaluation and testing

Cross validation over multiple hypothesis classes

Cross validation can also be used as a way to choose between forecasting methods
- Simply choose the method that minimizes the cross-validation estimate of risk
- If methods \(\widehat{f}\) are just fixed functions, for tsCV this is just ERM again
Case where CV is useful: when we have multiple hypothesis classes of rules \(\{\mathcal{F}_c:c\in\mathcal{C}\}\)
- \(c\) is a tuning parameter affecting some aspect of the procedure, often a measure of complexity
- Eg for AR(p) classes, \(\mathcal{F}_c=\{\beta_0+\sum_{j=1}^{p}\beta_{j}y_{t-j}:\ \beta\in\mathcal{R}^{p+1}\}\), tuning parameter is \(p\)
- For each \(c\in\mathcal{C}\), run ERM over \(\mathcal{F}_c\)
  \[\widehat{f}^{c}(\mathcal{Y}_t)=\underset{f\in\mathcal{F}_c}{\arg\min}\frac{1}{t-h}\sum_{s=1}^{t-h}\ell(y_{s+h},f(\mathcal{Y}_{s}))\]
- Choose \(\widehat{c}^{CV}\) by minimizing Cross-Validation risk of \(\widehat{f}^{c}()\) \[\widehat{c}^{CV}=\underset{c\in\mathcal{C}}{\arg\min}\widehat{R}^{CV}(\widehat{f}^{c})\]
- Forecast using \(\widehat{f}^{\widehat{c}^{CV}}\)

Evaluating Cross Validation

Cross validation over multiple classes not the same as ERM because \(\widehat{f}^{c}\) is not a fixed function but depends on the data
Often, classes are ordered: \(\mathcal{F}_{c_1}\subset \mathcal{F}_{c_2}\) if \(c_1<c_2\)
- Compare small classes of low complexity to big classes of high complexity: More lags, more covariates, etc
Low complexity classes may have high empirical risk but low error on test set, vice versa for high complexity classes
Cross validation accounts for these differences: won’t always pick the highest complexity class
Still have some approximation error, because test set distribution \(\neq\) future population distribution
- Error increases in complexity of \(\mathcal{C}\), which is usually small: one or two parmeters
- Worse when training and test samples are dependent: helped slightly by using tsCV over k-fold
In general, provides fairly good approximation to generalization error but can be extremely costly to compute
- Solve an optimization problem for each \(c\in\mathcal{C}\) for each \(t\) in sample
- Can be hundreds or thousands of times as many computations as a single ERM forecast
Variants or special cases which are less costly but provide similar or not much worse performance may be preferred

Information Criteria

Look for a simple measure \(\mathcal{J}()\) of generalization error of rule \(f\) using a measure of complexity of class \(\mathcal{F}_c\), \(\mathcal{J}(\mathcal{F}_c)\)
Note that in ERM, scale of risk doesn’t matter: Applying increasing transform to risk gives same minimizer
With penalization, effect on minimizer does depend on relative scale of criterion and penalty
Use a normalized risk measure, called a log likelihood to ensure constant scale
- Takes form \(-2\log(p(\mathcal{Y}_T,f))\) where \(p(\mathcal{Y}_T,f)\) is a probability distribution over \(\mathcal{Y}_T\)
- Normalized form for square loss is \((T-h)\log(\frac{1}{T-h}\sum_{t=1}^{T-h}(y_{t+h}-f(.))^2)\)
- Cross entropy loss for binary prediction already normalized
A penalized log likelihood adds a penalty \(\mathcal{J}\) to normalized risk
Criterion value of class \(c\) is the minimum value of the penalized log likelihood \(\underset{f\in\mathcal{F}_c}{\min}(-2\log(p(\mathcal{Y}_T,f))+\mathcal{J}(\mathcal{F}_c))\)
Common penalties based on number of parameters p
- Akaike Information Criterion, AIC: \(\mathcal{J}(\mathcal{F}_c)=2p\)
  - Also AICc: \(\mathcal{J}(\mathcal{F}_c)=2p+\frac{2p(p+1)}{T-p-1}\)
- Bayesian Information Criterion, BIC: \(\mathcal{J}(\mathcal{F}_c)=2p\log(T-h)\) (also called Schwarz Criterion, SC)
Choose a rule which minimizes the penalized log likelihood \(\widehat{f}^{\mathcal{J}}=\underset{f\in\mathcal{F}_c,c\in\mathcal{C}}{\arg\min}(-2\log(p(\mathcal{Y}_T,f))+\mathcal{J}(\mathcal{F}_c))\)

Evaluation and implementation

AIC and BIC designed to pick a rule with low risk
BIC has slightly larger penalty: downgrades more complex models more
- Under some very strong conditions, AIC “close to” true risk, at least when T is large
- BIC instead designed to pick “true model” when T is large
For forecasting usually care about former rather than latter goal
- For this reason, Hyndman & Athanasopolous text recommends AIC/AICc
But be cautious: result requires large samples, and in small samples AIC can be bad measure of risk
- For this reason, Diebold text recommends BIC, as slightly larger penalty accounts for chance of overfitting from misestimation of risk
- AICc, closer to AIC but in between, is compromise
Both estimates usually provide worse measure of risk than cross validation
- Sometimes faster: \(\approx T\) times fewer optimizations
  - Exception is linear models, where known formula simplifies CV calculations
Automatically reported with Arima, VAR, ets, etc, or ex post by CV
- AICc also part of auto.arima, to choose between orders of AR
- VARselect computes AIC/BIC for VAR over loss \(\log\det(\sum_{t=1}^{T-h}(y_{t+h}-\widehat{y}_{t})(y_{t+h}-\widehat{y}_{t})^\prime)\)

Exercise

Try out Information Criterion based evaluation approaches on Industrial Production data in a Kaggle Notebook.

Parameter size based penalties

Penalization based on number of parameters require solving model for each parameter set \(c\in\mathcal{C}\)
- Fine if only a small number: once per lag order of AR/VAR up to max lag, hard if sets \(\mathcal{F}_c\) are every possible subset of p regressors: \(2^p\) regressions
Works if best predictions come from ignoring all but a subset of predictors, and leaving rest unrestricted
Alternative: allow all predictors to matter, but reduce total size of parameters by penalizing their norm
Linear hypothesis classes \(\mathcal{F}=\{\beta_0+\sum_{j=1}^m\beta_j z_j:\ \beta\in\mathbb{R}^{m+1}\}\), penalty is \(\lambda\Vert\beta\Vert^p_p\)
Norm \(\Vert\beta\Vert_p\) is a measure of “size” of coefficients, e.g.
- L1 (LASSO) Penalty \(\Vert\beta\Vert^1_1=\sum_{j=0}^d\|\beta_j\|\)
- L2 (Ridge) Penalty \(\Vert\beta\Vert^2_2=\sum_{j=0}^d(\beta_j)^2\)
- Mixed L1/L2 (Elastic Net) Penalty \(\Vert\beta\Vert_{1/2,a}=a\Vert\beta\Vert^1_1+(1-a)\Vert\beta\Vert^2_2\) for \(a\in[0,1]\)
\(\lambda\geq 0\) decides size of penalty term
- Large value means strong restrictions: coefficients shrink a lot and class much simpler
- Small value allows more complexity, at cost of greater overfitting risk. 0 means no restrictions: estimator is just ERM
Penalized estimator solves \(\widehat{f}^\lambda=\underset{f\in\mathcal{F}}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\lambda\Vert\beta\Vert_p^p)\)

Evaluation and Implementation

Norm penalized estimators address overfitting by preventing large fluctuations in values of parameters
Type of norm controls which kinds of fluctuations are more or less restricted
L2 penalty more strongly affected by very large values, less by small values
- All values partially shrink
L1 penalty equally strongly affected by small and large values
- Allows some larger coefficients, but shrinks a lot smaller ones, often to exactly 0: sparsity
- Acts like penalties on # of parameters, but non-0 coefficients also slightly shrunk
Elastic net in between, by amount controlled by \(a\): small terms go to 0, but big terms also shrunk
For a fixed penalty size \(\lambda\) computation is easy: one instead of many optimizations
Scale problem: how to choose \(\lambda\)?
- Typically, find \(\widehat{f}^\lambda\) for many \(\lambda\) and select \(\widehat{\lambda}^{CV}\) by cross validation
- Efficient LARS algorithm makes computation for all \(\lambda\) fast in linear model
Implemented for linear hypotheses and square loss, cross-entropy, multivariate square loss, etc in glmnet
- cv.glmnet implements cross validation parameter selection, but not time series version

What kinds of model classes?

Penalizing size of parameters can substitute for using smaller number of parameters
Can include more predictors for same level of generalization error if penalized
- Can use many predictors, or nonlinear functions of predictors: polynomials, splines, etc
- Benefit is better ability to predict and get close to oracle predictor
- Cost is that values of parameters get distorted
How complicated can model class be?
- Depends on number of possible relevant predictors and how big complicated a model within that class
LASSO can handle more parameters than data points, if most are (close to) 0
- Up to exponentially many if very sparse and large \(\lambda\) chosen
Ridge can handle arbitrary number of predictors
- But effects of each become progressively more muted
In practice, use LASSO with many weakly related predictors, of which only some might matter
Use ridge with many, possibly related predictors, each of which contributes a little bit to the forecast
- Often believable for lags, so common to use in (V)AR estimates
Elastic net is in between: some correlated predictors, some not

Function based penalties

Rather than penalizing parameters of predictor \(f()\), you can penalize \(f\) directly
Penalized estimator solves, for \(\lambda\Vert f\Vert\) some norm on \(f\),
- \(\widehat{f}^\lambda=\underset{f\in\mathcal{F}}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\lambda\Vert f\Vert)\)
For classes \(\mathcal{F}\) defined by parameters, like linear hypothesis class, little essential difference
- (Usually) equivalent to some penalty on parameters
But if class \(\mathcal{F}\) isn’t dependent on a (finite) set of parameters, eg \(\mathcal{F}=\{\text{all functions of Z}\}\), can be very different
Most prominent example: mean squared second derivative penalty \(\Vert f\Vert=\int f^{\prime\prime}(z)^2dz\)
- Penalizes “wiggliness” regardless of particular shape: minimal for straight line
Amazing fact (Wahba 1991): penalized minimizer over class of all functions has known, computable form
- Called a “Smoothing Spline”: piecewise cubic polynomial between data points
- Equivalent to ridge penalty over (T-dimensional) class of piecewise cubic polynomials
Implemented as smooth.spline as well as in specialized functions in libraries gam or mgcv
Choose \(\lambda\) by CV again: fast algorithm known for non-time series case

Smoothing Splines: GDP growth from Lagged GDP growth

library(fpp2) #Forecasting
library(glmnet) #Lasso and Ridge
library(vars) #Vector Autoregressions
library(fredr) #Access FRED Data

##Obtain and transform NIPA Data (cf Lecture 07)

fredr_set_key("8782f247febb41f291821950cf9118b6") #Key I obtained for this class

#US GDP components 
GDP<-fredr(series_id = "GDP",
           observation_start = as.Date("1947-01-01"),
           observation_end=as.Date("2021-03-07"),
           vintage_dates = as.Date("2021-03-07")) #Gross Domestic Product

#US GDP components from NIPA tables (cf http://www.bea.gov/national/pdf/nipaguid.pdf)
PCEC<-fredr(series_id = "PCEC",
            observation_start = as.Date("1947-01-01"),
           observation_end=as.Date("2021-03-07"),
           vintage_dates = as.Date("2021-03-07")) #Personal consumption expenditures
FPI<-fredr(series_id = "FPI",
           observation_start = as.Date("1947-01-01"),
           observation_end=as.Date("2021-03-07"),
           vintage_dates = as.Date("2021-03-07")) #Fixed Private Investment
CBI<-fredr(series_id = "CBI",
           observation_start = as.Date("1947-01-01"),
           observation_end=as.Date("2021-03-07"),
           vintage_dates = as.Date("2021-03-07")) #Change in Private Inventories
NETEXP<-fredr(series_id = "NETEXP",
              observation_start = as.Date("1947-01-01"),
           observation_end=as.Date("2021-03-07"),
           vintage_dates = as.Date("2021-03-07")) #Net Exports of Goods and Services
GCE<-fredr(series_id = "GCE",
           observation_start = as.Date("1947-01-01"),
           observation_end=as.Date("2021-03-07"),
           vintage_dates = as.Date("2021-03-07")) #Government Consumption Expenditures and Gross Investment

#Format the series as quarterly time series objects, starting at the first date
gdp<-ts(GDP$value,frequency = 4,start=c(1947,1),names="Gross Domestic Product") 
pcec<-ts(PCEC$value,frequency = 4,start=c(1947,1),names="Consumption")
fpi<-ts(FPI$value,frequency = 4,start=c(1947,1),names="Fixed Investment")
cbi<-ts(CBI$value,frequency = 4,start=c(1947,1),names="Inventory Growth")
invest<-fpi+cbi #Private Investment
netexp<-ts(NETEXP$value,frequency = 4,start=c(1947,1),names="Net Exports")
gce<-ts(GCE$value,frequency = 4,start=c(1947,1),names="Government Spending")

#Convert to log differences to ensure stationarity and collect in frame
NIPAdata<-ts(data.frame(diff(log(gdp)),diff(log(pcec)),diff(log(invest)),diff(log(gce))),frequency = 4,start=c(1947,2))

# Construct lags

l1NIPA<-window(stats::lag(NIPAdata,-1),start=c(1949,2),end=c(2020,4))
l2NIPA<-window(stats::lag(NIPAdata,-2),start=c(1949,2),end=c(2020,4))
l3NIPA<-window(stats::lag(NIPAdata,-3),start=c(1949,2),end=c(2020,4))
l4NIPA<-window(stats::lag(NIPAdata,-4),start=c(1949,2),end=c(2020,4))
l5NIPA<-window(stats::lag(NIPAdata,-5),start=c(1949,2),end=c(2020,4))
l6NIPA<-window(stats::lag(NIPAdata,-6),start=c(1949,2),end=c(2020,4))
l7NIPA<-window(stats::lag(NIPAdata,-7),start=c(1949,2),end=c(2020,4))
l8NIPA<-window(stats::lag(NIPAdata,-8),start=c(1949,2),end=c(2020,4))

#convert to matrices
xvarsdf<-data.frame(l1NIPA,l2NIPA,l3NIPA,l4NIPA,l5NIPA,l6NIPA,l7NIPA,l8NIPA)
xvars<-as.matrix(xvarsdf)
NIPAwindow<-matrix(window(NIPAdata,start=c(1949,2),end=c(2020,4)),length(l1NIPA[,1]),4)

#Spline models, yeah?
library(gam) #loads smoothing spline package
library(mgcv) #A slightly different smoothing spline package
lagDlogGDP<-xvars[,1]
DlogGDP<-NIPAwindow[,1]
#Smoothing spline Regression of GDP on lagged GDP
# Default behavior fits penalty parameter by generalized Cross Validation

# Predict GDP growth using smoothing spline in lagged GDP growth
gdp.spl<-smooth.spline(lagDlogGDP,DlogGDP)
splinepredictions<-predict(gdp.spl)

\(\lambda=\) 7.509843710^{-4} chosen by (non-time series) cross validation gives nonlinear fit

#Build Plot
gdpdata<-data.frame(xvars[,1],NIPAwindow[,1],splinepredictions$x,splinepredictions$y)
colnames(gdpdata)<-c("LDlGDP","DlGDP","xspline","yspline")

ggplot(gdpdata,aes(x=LDlGDP,y=DlGDP))+geom_point()+
  geom_line(aes(x=xspline,y=yspline),color="red",size=1)+
  geom_smooth(method=lm,se=FALSE,color="blue")+
  ggtitle("Smoothing Spline and Linear Fits of This vs Last Quarter's GDP Growth",
          subtitle="Spline bends to fit far off points")+
  xlab("Change in Log GDP, t-1")+ylab("Change in Log GDP, t")

# An alternate approach which uses "gam" command to fit spline
# Advantage is that it's built in to ggplot
# Disadvantage is that it uses only default options, so not customizable
# Default values here selct much higher smoothing, and give close to linear fit

 # ggplot(gdpdata,aes(x=LDlGDP,y=DlGDP))+geom_point()+
 #   ggtitle("Spline and OLS Fit of This vs  Last Quarter's GDP Growth",
 #           subtitle="There are two lines which exactly overlap")+
 #   xlab("Change in Log GDP, t-1")+ylab("Change in Log GDP, t")+
 #   geom_smooth(method=gam,se=FALSE,color="blue")+
 #   geom_smooth(method=lm,se=FALSE,color="red")

Exotic penalties

Example penalties I mentioned are just most common ones
New and different penalties can be used to emphasize different features of the prediction or type of data used
- Thousands in statistics/ML literature, and new ones every day
Parameter-based penalties can emphasize different sets of parameters or features of size or shape
Function based penalties can look at fine or large scale features, or multiple variables
- Cases which allow finite computable solution are called (reproducing) kernel methods
Shared principle: want to trade off flexible form to match patterns in observed data against size and complexity of class
- Different prediction tasks will call for different shapes, sizes of data, so specialized penalties can help

Penalties versus Bounds

Alternately, can just run ERM over predictor class \(\mathcal{F}\) with restriction that \(\Vert f \Vert \leq c\) for some c
- \(\widehat{f}^{ERM,c}=\underset{f\in\mathcal{F}, \Vert f \Vert\leq c}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t)))\)
E.g. only look at functions with norm of predictors bounded \(\mathcal{F}_c=\{\beta_0+\sum_{j=1}^d\beta_jz_j:\ \beta\in\mathbb{R}^{d+1}, \Vert\beta\Vert_p\leq c\}\)
Can choose \(c\) by cross-validation, just like with penalty
Claim: ERM over class with bound imposed is numerically identical to some penalized empirical risk minimizer
More formally, for any \(c>0\), there is some \(\widehat{\lambda}(c)\) such that
- \(\widehat{f}^{ERM,c}=\underset{f\in\mathcal{F}}{\arg\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\widehat{\lambda}(c)\Vert f\Vert)\)
This is application of Lagrange multiplier theorem from constrained optimization
Slight difference is that \(\widehat{\lambda}(c)\) may depend on the data, whereas it is fixed in penalized case
- Though since \(c\) and \(\lambda\) usually chosen based on data anyway, even this usually not a difference
Thus, penalization inherits properties of ERM, but with a new, restricted class of predictors
- Explains variety of possible penalties: just different functions of the data

Regularization for GDP Forecasting

library(dplyr) #Data manipulation
library(knitr) #Use knitr to make tables
library(kableExtra) #Extra options for tables
library(tibble) #More data manipulation
#Calculate information criteria
IC<-VARselect(NIPAdata,lag.max=8)

# Fit Optimal VARs by AIC, BIC
AICVAR<-VAR(NIPAdata,p=IC$selection[1])
BICVAR<-VAR(NIPAdata,p=IC$selection[3])

#Construct 1 period ahead foecasts
AICpred<-forecast(AICVAR,h=1)$forecast
BICpred<-forecast(BICVAR,h=1)$forecast

#Extract point forecasts
AICfcsts<-select(data.frame(AICpred),ends_with("Forecast"))
BICfcsts<-select(data.frame(BICpred),ends_with("Forecast"))


#Fit LASSO VAR in all variables up to 8 lags
lassovar<-glmnet(xvars,NIPAwindow,family="mgaussian")

#Show range of lambda values from unpenalized to where all variables gone
#lassovar$lambda 

#Show coefficients for large value of lambda (heavy penalization, most coefs gone)
#coef(lassovar,s=0.02)

#Show coefficients for intermediate value of lambda (medium penalization, some coefs gone)
#coef(lassovar,s=0.008)

#Show coefficients for small value of lambda (little penalization, most coefs remain)
#coef(lassovar,s=0.001)

## Fit LASSO VAR with 10-fold cross-validation
lassovarCV<-cv.glmnet(xvars,NIPAwindow,family="mgaussian")

#Lambda giving minimum 10-fold Cross Validation Error
CVlambda<-lassovarCV$lambda.min

#Baroque construction to take last period and format it for predict command
#Somehow need 2 rows in prediction value or it throws a fit 
#Presumably because matrix coerced to vector and then it thinks it can't multiply anymore
maxT<-length(NIPAwindow[,1]) #index of last period
newX<-t(as.matrix(data.frame(as.vector(t(NIPAwindow[(maxT-8):(maxT-1),])), as.vector(t(NIPAwindow[(maxT-7):maxT,])))))

#Predict 2021Q1 series
cvlassopredictions<-window(ts(predict(lassovarCV,newx = newX)[,,1],start=c(2020,4),frequency=4),start=c(2021,1))

#Fit L2 Penalized VAR in all variables up to 8 lags
ridgevar<-glmnet(xvars,NIPAwindow,alpha=0,family="mgaussian")

#Show range of lambda values from unpenalized to where all variables gone
#ridgevar$lambda 

#Show coefficients for large value of lambda (heavy penalization, coefs shrink a lot)
#coef(ridgevar,s=10)

#Show coefficients for intermediate value of lambda (medium penalization, coefs shrink a little)
#coef(ridgevar,s=1)

#Show coefficients for small value of lambda (little penalization, coefs close to OLS values)
#coef(ridgevar,s=0.01)

## Fit Ridge VAR with 10-fold cross-validation
ridgevarCV<-cv.glmnet(xvars,NIPAwindow,alpha=0,family="mgaussian")

#Lambda giving minimum 10-fold Cross Validation Error
CVlambda2<-ridgevarCV$lambda.min

#Predict 2021Q1 series
cvridgepredictions<-window(ts(predict(ridgevarCV,newx = newX)[,,1],start=c(2020,4),frequency=4),start=c(2021,1))

# #Plot Data Series
# autoplot(window(NIPAdata,start=c(2012,4)))+
#   autolayer(cvlassopredictions)+
#   autolayer(cvridgepredictions)

#  autolayer(AICpred$diff.log.gdp..$mean)+
 # autolayer(BICpred$diff.log.gdp..$mean)

Let’s go back to our GDP forecasting example using history of changes in log of \(Y\), \(C\), \(I\), \(G\)
Predictor class will be Vector Autoregressions in up to 8 lags of all 4 variables
- 33 parameters per equation, 132 total
Estimate by AIC, BIC, LASSO, and Ridge
AIC selects 5 lags, BIC selects 1 lag, reflecting smaller penalty for AIC
10-fold cross validation selects \(\lambda=\) 0.0059832 for LASSO, \(\lambda=\) 2.3599158 for ridge
LASSO keeps just 4 nonzero parameters, while ridge chooses non-zero value for all 132
- Compare 20 for BIC, 84 for AIC, both ordered

#Collect 2021Q1 Forecasts into a data frame
var1table<-data.frame(
  t(cvridgepredictions),
  t(cvlassopredictions),
  t(AICfcsts),
  t(BICfcsts)
)
rownames(var1table)<-c("GDP","Consumption","Investment","Government")
#Make Table of Estimated Coefficients
kable(var1table,
  col.names=c("Ridge","LASSO","AIC","BIC"),
  caption="2021Q1 Forecasts from Different Methods") %>%
  kable_styling(bootstrap_options = "striped", font_size = 16)

2021Q1 Forecasts from Different Methods
	Ridge	LASSO	AIC	BIC
GDP	0.0151859	0.0151859	0.0061177	0.0165646
Consumption	0.0153646	0.0153646	0.0092199	0.0179683
Investment	0.0157096	0.0157096	0.0118236	0.0245480
Government	0.0152194	0.0152194	0.0247882	0.0108015

Ridge Coefficients

#Make Coefficient Table
zzz<-data.frame(as.matrix(coef(ridgevarCV)$y1),
                as.matrix(coef(ridgevarCV)$y2),
                as.matrix(coef(ridgevarCV)$y3),
                as.matrix(coef(ridgevarCV)$y4))


rownames(zzz)<-rownames(data.frame(as.matrix(coef(ridgevarCV)$y1)))
zzz %>%
mutate_all(function(x) {
    cell_spec(x, bold = T,
              color = spec_color(x))
  }) %>%
  kable(escape=F,
  col.names=c("GDP","C","I","G"),
  caption="Coefficients in Ridge Regression") %>%
  kable_styling(bootstrap_options = "striped", font_size=10)

Coefficients in Ridge Regression
	GDP	C	I	G
(Intercept)	0.0151858897298007	0.0153645580354301	0.0157096142355653	0.0152194105076093
diff.log.gdp..	2.30031255499118e-37	9.38827596242842e-38	5.67710819257864e-37	4.1908660740873e-37
diff.log.pcec..	2.03414871590585e-37	-1.42003271917175e-38	1.03518480719846e-36	2.43101694930497e-37
diff.log.invest..	5.81179178431625e-38	4.89847015090077e-38	1.72474357338571e-37	2.89445812951768e-38
diff.log.gce..	8.14333315056698e-38	4.65662452126473e-38	-4.97908409989161e-37	6.06876408802504e-37
diff.log.gdp…1	2.66230340418565e-37	2.17899432587549e-37	5.89720423452404e-38	5.11288435463506e-37
diff.log.pcec…1	2.45438578253927e-37	2.72574002810306e-37	-7.46099917642788e-38	4.1874800901528e-37
diff.log.invest…1	6.16468295468514e-38	3.76230895303787e-38	1.43513553628022e-37	4.89935776717903e-38
diff.log.gce…1	2.88850420892817e-38	-2.0343767595241e-39	-4.35758744418575e-37	4.39598403688305e-37
diff.log.gdp…2	1.69764387502009e-37	1.38669330320158e-37	-5.53541951770884e-37	6.43937167156774e-37
diff.log.pcec…2	2.57387614260426e-37	2.31769546123258e-37	-1.50074670774808e-37	4.96432317583687e-37
diff.log.invest…2	2.05248586117648e-38	8.68966908404487e-40	-4.0244191573353e-38	8.19254014008568e-38
diff.log.gce…2	-1.52460714936195e-38	5.71964171443974e-38	-6.40050604726444e-37	3.00746325218341e-37
diff.log.gdp…3	1.02868823860545e-37	1.79075449325731e-37	-8.72166554492289e-37	6.23993172561969e-37
diff.log.pcec…3	1.91306415158673e-37	1.56530919140519e-37	-1.39775744944728e-37	4.57958482381477e-37
diff.log.invest…3	3.99606222132951e-39	2.86479073484879e-38	-2.46853969613774e-37	1.07679973010426e-37
diff.log.gce…3	-1.70497629422387e-38	3.24506547360457e-38	-2.91745675031562e-37	1.4731800906771e-37
diff.log.gdp…4	-9.47944652278077e-39	8.20701352695042e-38	-9.8328564921561e-37	3.94870645320733e-37
diff.log.pcec…4	1.41634175467069e-37	3.09685052861973e-37	-9.95791656490379e-37	4.61782046707173e-37
diff.log.invest…4	-2.58637064333019e-38	-3.58456162420379e-38	-1.53548027819855e-37	6.26430258320072e-38
diff.log.gce…4	4.46754017745955e-38	9.68030481339711e-38	-1.69478106557769e-38	3.76103120629892e-38
diff.log.gdp…5	6.29617484125634e-38	1.51638436741035e-37	-5.19448139677564e-37	2.21421305380236e-37
diff.log.pcec…5	1.29010668326766e-37	1.52005958450989e-37	-1.5993318690095e-37	2.33780106386364e-37
diff.log.invest…5	-1.59227211270025e-38	4.86621283586907e-39	-1.84306909715284e-37	4.02670363270379e-38
diff.log.gce…5	1.06431563250505e-37	1.13658269140945e-37	3.1415103584003e-37	5.11064147737253e-38
diff.log.gdp…6	9.21687780778919e-38	1.81069175743602e-37	-2.22856911137888e-37	1.45859585321776e-37
diff.log.pcec…6	1.92215622922207e-37	3.056783686707e-37	-2.1135329401155e-37	2.35697554061445e-37
diff.log.invest…6	-1.47522987657778e-38	-6.03934804215541e-39	-5.93676171946803e-38	3.79610675225019e-39
diff.log.gce…6	1.01173813993219e-37	8.62521387934146e-38	3.50630401000713e-37	3.64942205833305e-38
diff.log.gdp…7	1.01868469749412e-37	1.87365890427143e-37	-1.60968447243311e-37	8.74068306105839e-38
diff.log.pcec…7	1.30416714458953e-37	1.9113863912783e-37	-7.5849159796442e-38	1.84329801657064e-37
diff.log.invest…7	1.09898866000851e-38	2.19118970648559e-38	4.91704674317236e-39	-1.70444094452624e-38
diff.log.gce…7	6.045842474227e-38	6.63072766928136e-38	2.19700121521296e-38	7.83684941744905e-38

LASSO Coefficients

#Make Coefficient Table
zzz<-data.frame(as.matrix(coef(lassovarCV)$y1),
                as.matrix(coef(lassovarCV)$y2),
                as.matrix(coef(lassovarCV)$y3),
                as.matrix(coef(lassovarCV)$y4))

rownames(zzz)<-rownames(data.frame(as.matrix(coef(lassovarCV)$y1)))
zzz %>%
mutate_all(function(x) {
    cell_spec(x, bold = T,
              color = if_else(x!=0,"red","black"))
  }) %>%
  kable(escape=F,
  col.names=c("GDP","C","I","G"),
  caption="Coefficients in LASSO Regression") %>%
  kable_styling(bootstrap_options = "striped", font_size=10)

Coefficients in LASSO Regression
	GDP	C	I	G
(Intercept)	0.0151858897298007	0.0153645580354301	0.0157096142355653	0.0152194105076093
diff.log.gdp..	0	0	0	0
diff.log.pcec..	0	0	0	0
diff.log.invest..	0	0	0	0
diff.log.gce..	0	0	0	0
diff.log.gdp…1	0	0	0	0
diff.log.pcec…1	0	0	0	0
diff.log.invest…1	0	0	0	0
diff.log.gce…1	0	0	0	0
diff.log.gdp…2	0	0	0	0
diff.log.pcec…2	0	0	0	0
diff.log.invest…2	0	0	0	0
diff.log.gce…2	0	0	0	0
diff.log.gdp…3	0	0	0	0
diff.log.pcec…3	0	0	0	0
diff.log.invest…3	0	0	0	0
diff.log.gce…3	0	0	0	0
diff.log.gdp…4	0	0	0	0
diff.log.pcec…4	0	0	0	0
diff.log.invest…4	0	0	0	0
diff.log.gce…4	0	0	0	0
diff.log.gdp…5	0	0	0	0
diff.log.pcec…5	0	0	0	0
diff.log.invest…5	0	0	0	0
diff.log.gce…5	0	0	0	0
diff.log.gdp…6	0	0	0	0
diff.log.pcec…6	0	0	0	0
diff.log.invest…6	0	0	0	0
diff.log.gce…6	0	0	0	0
diff.log.gdp…7	0	0	0	0
diff.log.pcec…7	0	0	0	0
diff.log.invest…7	0	0	0	0
diff.log.gce…7	0	0	0	0

AIC Coefficients

#Make Coefficient Table
zzz<-data.frame(AICVAR$varresult$diff.log.gdp..$coefficients,
                AICVAR$varresult$diff.log.pcec..$coefficients,
                AICVAR$varresult$diff.log.invest..$coefficients,
                AICVAR$varresult$diff.log.gce..$coefficients)

rownames(zzz)<-rownames(data.frame(AICVAR$varresult$diff.log.gdp..$coefficients))
zzz %>%
mutate_all(function(x) {
    cell_spec(x, bold = T,
              color = if_else(x!=0,"red","black"))
  }) %>%
  kable(escape=F,
  col.names=c("GDP","C","I","G"),
  caption="Coefficients in AIC Regression") %>%
  kable_styling(bootstrap_options = "striped", font_size=10)

Coefficients in AIC Regression
	GDP	C	I	G
diff.log.gdp…l1	-0.746943763057611	-0.339280717024153	-4.06998274098274	-0.115524482447805
diff.log.pcec…l1	0.542887399420515	0.0603927127377116	3.52746849769727	0.172366474214833
diff.log.invest…l1	0.147665750520901	0.0996000486935814	0.628142284010407	0.0593200663481155
diff.log.gce…l1	0.232979968469864	0.127013734113583	0.546793026132765	0.490074188116684
diff.log.gdp…l2	0.148276807314082	0.0030153149531751	0.700719436813809	0.296184284489155
diff.log.pcec…l2	-0.0468729316703431	0.0689250929776537	-0.53805307285928	-0.0349076307322074
diff.log.invest…l2	0.0142184321886127	0.0236461267511497	-0.0136170202386893	-0.0316570502447788
diff.log.gce…l2	-0.025225073658209	-0.102578974706382	-0.00283444415118627	0.0579397109188032
diff.log.gdp…l3	0.233299129276807	0.193302574074163	0.303322310559933	0.0908340876606647
diff.log.pcec…l3	-0.0516608151840398	-0.007015396497794	-0.1401673205575	-0.160195069905048
diff.log.invest…l3	-0.0585268233889642	-0.0584938941415848	-0.155597845827494	0.0250097255157367
diff.log.gce…l3	-0.0913173183824062	-0.0280589876912641	-0.420294633255246	0.063747863705081
diff.log.gdp…l4	0.301371080744814	0.120097659148702	1.60230119894812	0.327298748721874
diff.log.pcec…l4	0.130764968897992	0.185885995556202	0.0787419661514829	-0.248639740861761
diff.log.invest…l4	-0.0700006765330574	-0.0166441973406796	-0.417800332681899	-0.00209708631479729
diff.log.gce…l4	-0.146455512729859	-0.0406473284896966	-0.67189744156677	-0.0470418491889946
diff.log.gdp…l5	-0.610591245995868	-0.169926856449345	-2.29212310983113	-0.447928309402065
diff.log.pcec…l5	0.468231178107616	0.357972406028735	0.894382816968065	0.306927940613055
diff.log.invest…l5	0.0432689297333812	-0.0198516764329586	0.192061121492337	0.0570281247813538
diff.log.gce…l5	0.186550699795729	0.125842023517093	0.595685589600826	0.0211641039656517
const	0.00567042047478156	0.00620177914475714	0.00935819315081179	0.00184789666280374

BIC Coefficients

#Make Coefficient Table
zzz<-data.frame(BICVAR$varresult$diff.log.gdp..$coefficients,
                BICVAR$varresult$diff.log.pcec..$coefficients,
                BICVAR$varresult$diff.log.invest..$coefficients,
                BICVAR$varresult$diff.log.gce..$coefficients)

rownames(zzz)<-rownames(data.frame(BICVAR$varresult$diff.log.gdp..$coefficients))
zzz %>%
mutate_all(function(x) {
    cell_spec(x, bold = T,
              color = if_else(x!=0,"red","black"))
  }) %>%
  kable(escape=F,
  col.names=c("GDP","C","I","G"),
  caption="Coefficients in BIC Regression") %>%
  kable_styling(bootstrap_options = "striped", font_size=10)

Coefficients in BIC Regression
	GDP	C	I	G
diff.log.gdp…l1	-0.616078505592147	-0.216952693351494	-4.11590175514894	0.0806110991176697
diff.log.pcec…l1	0.51185181145174	0.0369735342765242	3.65466265933299	-0.00896787215583153
diff.log.invest…l1	0.138957227319627	0.0887645774343753	0.661995796320753	0.0499547661798207
diff.log.gce…l1	0.186098309220188	0.10353264493876	0.136023745308116	0.614133345716592
const	0.0116164872700755	0.0150636667863203	0.00984807889662198	0.00412528923826631

Conclusions

ERM does well for choosing within a simple hypothesis classes
Minimizing penalized criterion is a sensible way to improve accounting for overfitting
- Based on number (AIC/BIC) or size (L1/L2) of parameters or functions
When choosing between multiple procedures, including ERM over multiple classes, can use measure of risk for each
- Cross-validation
Penalization acts like restricting size of parameter class
- Can choose penalty by cross-validating
Goal of regularization is to trade off in-sample and out-of-sample fit
- Other tools exist for finding the “right” descriptive model

References

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. “Foundations of Machine Learning” 2e MIT Press 2018
Grace Wahba “Spline Models for Observational Data” (1991)
- Overview of smoothing splines and penalization,
- Explains how and why one can work with all differentiable functions

Regularization

73-423 Forecasting for Economics and Business

Regularization

Risk Minimization

Structural Risk Minimization

Cross Validation

Cross validation over multiple hypothesis classes

Evaluating Cross Validation

Information Criteria

Evaluation and implementation

Exercise

Parameter size based penalties

Evaluation and Implementation

What kinds of model classes?

Function based penalties

Smoothing Splines: GDP growth from Lagged GDP growth

Exotic penalties

Penalties versus Bounds

Regularization for GDP Forecasting

Ridge Coefficients

LASSO Coefficients

AIC Coefficients

BIC Coefficients

Conclusions

References