Regularization
- Review of Generalization Problem and ERM
- Structural Risk Minimization Principle
- Cross Validation
- Penalized Estimation
Risk Minimization
- In statistical learning approach, goal is to produce rule \(f\in\mathcal{F}\) with low risk for true distribution
- \(R_p(f)=E_p\ell(y_{T+h},f(\mathcal{Y}_{T}))\)
- Empirical Risk Minimizer based on minimizing risk for empirical distribution \(\widehat{R}(f)=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))\)
- The difference between what we want and what we achieve is the generalization error
- \(R_p(\widehat{f})=\widehat{R}(\widehat{f})+(R_p(\widehat{f})-\widehat{R}(\widehat{f}))\)
- Data can tell you whether first term is small: minimized by ERM
- Theory required to ensure second term also small
- Can we do better by being more explicit about generalization error?
- Sometimes not much: “Fundamental Theorem of Machine Learning” (Mohri et al 2018)
- Under same conditions that ERM has strong performance bound, no method can do “substantially better”
- In particular, if \(\mathcal{F}\) has “low complexity”, excess risk, which is already small, cannot be made much smaller
- Does this mean we may as well always go with ERM?
- Not really! Especially for small samples \(T\) or classes with “high complexity”
Structural Risk Minimization
- Ideal method would take into account both empirical risk and generalization error
- Impossible to do exactly since generalization error part not known
- But it can be approximated or bounded!
- Idea: Find a term related to generalization error, called a penalty \(\mathcal{J}(f)\)
- Use a forecasting rule, called a Structural Risk Minimizer which minimizes empirical risk plus penalty \[\widehat{f}^{SRM}=\underset{f\in\mathcal{F}}{\arg\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))+\mathcal{J}(f))\]
- If penalty exactly equals generalization error, this is exactly the oracle estimator
- If penalized risk is a better approximation than the empirical risk, may still outperform ERM
- So, how do we find a good penalty?
- Many proposed methods
- Today: A selection of popular penalties and discussion of their properties
Cross Validation
- We have already seen cross validation as a way to assess forecast quality of a forecasting method \(\widehat{f}(\mathcal{Y}_{t})\)
- Use a forecasting rule constructed on a training sample and estimates risk on different test sample
- May do this for many test and training samples, then average the results
- E.g. Risk estimated from a rolling forecast origin (cf
tsCV
) \[\widehat{R}^{tsCV}(\widehat{f})=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},\widehat{f}(\mathcal{Y}_{t}))\]
- Many versions exist: test and samples can be blocks, weighted in different ways, etc
- Eg: K-fold CV for ERM over independent data \(\{(y_t,z_t)\}_{t=1}^{T}\)
- Split sample into K sets \(\mathcal{T}_k\) of size \(T/K\)
- For \(k=1\ldots K\) estimate \(\widehat{f}^{(-k)}=\underset{f\in\mathcal{F}}{\arg\min}\frac{1}{T-T/K}\sum_{t\notin\mathcal{T}_k}\ell(y_t,f(z_t))\)
- \(\widehat{R}^{kfoldCV}=\frac{1}{T}\sum_{k=1}^{K}\sum_{t\in\mathcal{T}_k}\ell(y_t,\widehat{f}^{(-k)}(z_t))\)
- For k=T, called “Leave One Out” Cross Validation
- CV reduces bias from overfitting by using different samples for evaluation and testing
Cross validation over multiple hypothesis classes
- Cross validation can also be used as a way to choose between forecasting methods
- Simply choose the method that minimizes the cross-validation estimate of risk
- If methods \(\widehat{f}\) are just fixed functions, for tsCV this is just ERM again
- Case where CV is useful: when we have multiple hypothesis classes of rules \(\{\mathcal{F}_c:c\in\mathcal{C}\}\)
- \(c\) is a tuning parameter affecting some aspect of the procedure, often a measure of complexity
- Eg for AR(p) classes, \(\mathcal{F}_c=\{\beta_0+\sum_{j=1}^{p}\beta_{j}y_{t-j}:\ \beta\in\mathcal{R}^{p+1}\}\), tuning parameter is \(p\)
- For each \(c\in\mathcal{C}\), run ERM over \(\mathcal{F}_c\)
\[\widehat{f}^{c}(\mathcal{Y}_t)=\underset{f\in\mathcal{F}_c}{\arg\min}\frac{1}{t-h}\sum_{s=1}^{t-h}\ell(y_{s+h},f(\mathcal{Y}_{s}))\]
- Choose \(\widehat{c}^{CV}\) by minimizing Cross-Validation risk of \(\widehat{f}^{c}()\) \[\widehat{c}^{CV}=\underset{c\in\mathcal{C}}{\arg\min}\widehat{R}^{CV}(\widehat{f}^{c})\]
- Forecast using \(\widehat{f}^{\widehat{c}^{CV}}\)
Evaluating Cross Validation
- Cross validation over multiple classes not the same as ERM because \(\widehat{f}^{c}\) is not a fixed function but depends on the data
- Often, classes are ordered: \(\mathcal{F}_{c_1}\subset \mathcal{F}_{c_2}\) if \(c_1<c_2\)
- Compare small classes of low complexity to big classes of high complexity: More lags, more covariates, etc
- Low complexity classes may have high empirical risk but low error on test set, vice versa for high complexity classes
- Cross validation accounts for these differences: won’t always pick the highest complexity class
- Still have some approximation error, because test set distribution \(\neq\) future population distribution
- Error increases in complexity of \(\mathcal{C}\), which is usually small: one or two parmeters
- Worse when training and test samples are dependent: helped slightly by using tsCV over k-fold
- In general, provides fairly good approximation to generalization error but can be extremely costly to compute
- Solve an optimization problem for each \(c\in\mathcal{C}\) for each \(t\) in sample
- Can be hundreds or thousands of times as many computations as a single ERM forecast
- Variants or special cases which are less costly but provide similar or not much worse performance may be preferred
Evaluation and implementation
- AIC and BIC designed to pick a rule with low risk
- BIC has slightly larger penalty: downgrades more complex models more
- Under some very strong conditions, AIC “close to” true risk, at least when T is large
- BIC instead designed to pick “true model” when T is large
- For forecasting usually care about former rather than latter goal
- For this reason, Hyndman & Athanasopolous text recommends AIC/AICc
- But be cautious: result requires large samples, and in small samples AIC can be bad measure of risk
- For this reason, Diebold text recommends BIC, as slightly larger penalty accounts for chance of overfitting from misestimation of risk
- AICc, closer to AIC but in between, is compromise
- Both estimates usually provide worse measure of risk than cross validation
- Sometimes faster: \(\approx T\) times fewer optimizations
- Exception is linear models, where known formula simplifies CV calculations
- Automatically reported with
Arima
, VAR
, ets
, etc, or ex post by CV
- AICc also part of
auto.arima
, to choose between orders of AR
VARselect
computes AIC/BIC for VAR over loss \(\log\det(\sum_{t=1}^{T-h}(y_{t+h}-\widehat{y}_{t})(y_{t+h}-\widehat{y}_{t})^\prime)\)
Exercise
- Try out Information Criterion based evaluation approaches on Industrial Production data in a Kaggle Notebook.
Parameter size based penalties
- Penalization based on number of parameters require solving model for each parameter set \(c\in\mathcal{C}\)
- Fine if only a small number: once per lag order of AR/VAR up to max lag, hard if sets \(\mathcal{F}_c\) are every possible subset of p regressors: \(2^p\) regressions
- Works if best predictions come from ignoring all but a subset of predictors, and leaving rest unrestricted
- Alternative: allow all predictors to matter, but reduce total size of parameters by penalizing their norm
- Linear hypothesis classes \(\mathcal{F}=\{\beta_0+\sum_{j=1}^m\beta_j z_j:\ \beta\in\mathbb{R}^{m+1}\}\), penalty is \(\lambda\Vert\beta\Vert^p_p\)
- Norm \(\Vert\beta\Vert_p\) is a measure of “size” of coefficients, e.g.
- L1 (LASSO) Penalty \(\Vert\beta\Vert^1_1=\sum_{j=0}^d\|\beta_j\|\)
- L2 (Ridge) Penalty \(\Vert\beta\Vert^2_2=\sum_{j=0}^d(\beta_j)^2\)
- Mixed L1/L2 (Elastic Net) Penalty \(\Vert\beta\Vert_{1/2,a}=a\Vert\beta\Vert^1_1+(1-a)\Vert\beta\Vert^2_2\) for \(a\in[0,1]\)
- \(\lambda\geq 0\) decides size of penalty term
- Large value means strong restrictions: coefficients shrink a lot and class much simpler
- Small value allows more complexity, at cost of greater overfitting risk. 0 means no restrictions: estimator is just ERM
- Penalized estimator solves \(\widehat{f}^\lambda=\underset{f\in\mathcal{F}}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\lambda\Vert\beta\Vert_p^p)\)
Evaluation and Implementation
- Norm penalized estimators address overfitting by preventing large fluctuations in values of parameters
- Type of norm controls which kinds of fluctuations are more or less restricted
- L2 penalty more strongly affected by very large values, less by small values
- All values partially shrink
- L1 penalty equally strongly affected by small and large values
- Allows some larger coefficients, but shrinks a lot smaller ones, often to exactly 0: sparsity
- Acts like penalties on # of parameters, but non-0 coefficients also slightly shrunk
- Elastic net in between, by amount controlled by \(a\): small terms go to 0, but big terms also shrunk
- For a fixed penalty size \(\lambda\) computation is easy: one instead of many optimizations
- Scale problem: how to choose \(\lambda\)?
- Typically, find \(\widehat{f}^\lambda\) for many \(\lambda\) and select \(\widehat{\lambda}^{CV}\) by cross validation
- Efficient LARS algorithm makes computation for all \(\lambda\) fast in linear model
- Implemented for linear hypotheses and square loss, cross-entropy, multivariate square loss, etc in
glmnet
cv.glmnet
implements cross validation parameter selection, but not time series version
What kinds of model classes?
- Penalizing size of parameters can substitute for using smaller number of parameters
- Can include more predictors for same level of generalization error if penalized
- Can use many predictors, or nonlinear functions of predictors: polynomials, splines, etc
- Benefit is better ability to predict and get close to oracle predictor
- Cost is that values of parameters get distorted
- How complicated can model class be?
- Depends on number of possible relevant predictors and how big complicated a model within that class
- LASSO can handle more parameters than data points, if most are (close to) 0
- Up to exponentially many if very sparse and large \(\lambda\) chosen
- Ridge can handle arbitrary number of predictors
- But effects of each become progressively more muted
- In practice, use LASSO with many weakly related predictors, of which only some might matter
- Use ridge with many, possibly related predictors, each of which contributes a little bit to the forecast
- Often believable for lags, so common to use in (V)AR estimates
- Elastic net is in between: some correlated predictors, some not
Function based penalties
- Rather than penalizing parameters of predictor \(f()\), you can penalize \(f\) directly
- Penalized estimator solves, for \(\lambda\Vert f\Vert\) some norm on \(f\),
- \(\widehat{f}^\lambda=\underset{f\in\mathcal{F}}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\lambda\Vert f\Vert)\)
- For classes \(\mathcal{F}\) defined by parameters, like linear hypothesis class, little essential difference
- (Usually) equivalent to some penalty on parameters
- But if class \(\mathcal{F}\) isn’t dependent on a (finite) set of parameters, eg \(\mathcal{F}=\{\text{all functions of Z}\}\), can be very different
- Most prominent example: mean squared second derivative penalty \(\Vert f\Vert=\int f^{\prime\prime}(z)^2dz\)
- Penalizes “wiggliness” regardless of particular shape: minimal for straight line
- Amazing fact (Wahba 1991): penalized minimizer over class of all functions has known, computable form
- Called a “Smoothing Spline”: piecewise cubic polynomial between data points
- Equivalent to ridge penalty over (T-dimensional) class of piecewise cubic polynomials
- Implemented as
smooth.spline
as well as in specialized functions in libraries gam
or mgcv
- Choose \(\lambda\) by CV again: fast algorithm known for non-time series case
Smoothing Splines: GDP growth from Lagged GDP growth
library(fpp2) #Forecasting
library(glmnet) #Lasso and Ridge
library(vars) #Vector Autoregressions
library(fredr) #Access FRED Data
##Obtain and transform NIPA Data (cf Lecture 07)
fredr_set_key("8782f247febb41f291821950cf9118b6") #Key I obtained for this class
#US GDP components
GDP<-fredr(series_id = "GDP",
observation_start = as.Date("1947-01-01"),
observation_end=as.Date("2021-03-07"),
vintage_dates = as.Date("2021-03-07")) #Gross Domestic Product
#US GDP components from NIPA tables (cf http://www.bea.gov/national/pdf/nipaguid.pdf)
PCEC<-fredr(series_id = "PCEC",
observation_start = as.Date("1947-01-01"),
observation_end=as.Date("2021-03-07"),
vintage_dates = as.Date("2021-03-07")) #Personal consumption expenditures
FPI<-fredr(series_id = "FPI",
observation_start = as.Date("1947-01-01"),
observation_end=as.Date("2021-03-07"),
vintage_dates = as.Date("2021-03-07")) #Fixed Private Investment
CBI<-fredr(series_id = "CBI",
observation_start = as.Date("1947-01-01"),
observation_end=as.Date("2021-03-07"),
vintage_dates = as.Date("2021-03-07")) #Change in Private Inventories
NETEXP<-fredr(series_id = "NETEXP",
observation_start = as.Date("1947-01-01"),
observation_end=as.Date("2021-03-07"),
vintage_dates = as.Date("2021-03-07")) #Net Exports of Goods and Services
GCE<-fredr(series_id = "GCE",
observation_start = as.Date("1947-01-01"),
observation_end=as.Date("2021-03-07"),
vintage_dates = as.Date("2021-03-07")) #Government Consumption Expenditures and Gross Investment
#Format the series as quarterly time series objects, starting at the first date
gdp<-ts(GDP$value,frequency = 4,start=c(1947,1),names="Gross Domestic Product")
pcec<-ts(PCEC$value,frequency = 4,start=c(1947,1),names="Consumption")
fpi<-ts(FPI$value,frequency = 4,start=c(1947,1),names="Fixed Investment")
cbi<-ts(CBI$value,frequency = 4,start=c(1947,1),names="Inventory Growth")
invest<-fpi+cbi #Private Investment
netexp<-ts(NETEXP$value,frequency = 4,start=c(1947,1),names="Net Exports")
gce<-ts(GCE$value,frequency = 4,start=c(1947,1),names="Government Spending")
#Convert to log differences to ensure stationarity and collect in frame
NIPAdata<-ts(data.frame(diff(log(gdp)),diff(log(pcec)),diff(log(invest)),diff(log(gce))),frequency = 4,start=c(1947,2))
# Construct lags
l1NIPA<-window(stats::lag(NIPAdata,-1),start=c(1949,2),end=c(2020,4))
l2NIPA<-window(stats::lag(NIPAdata,-2),start=c(1949,2),end=c(2020,4))
l3NIPA<-window(stats::lag(NIPAdata,-3),start=c(1949,2),end=c(2020,4))
l4NIPA<-window(stats::lag(NIPAdata,-4),start=c(1949,2),end=c(2020,4))
l5NIPA<-window(stats::lag(NIPAdata,-5),start=c(1949,2),end=c(2020,4))
l6NIPA<-window(stats::lag(NIPAdata,-6),start=c(1949,2),end=c(2020,4))
l7NIPA<-window(stats::lag(NIPAdata,-7),start=c(1949,2),end=c(2020,4))
l8NIPA<-window(stats::lag(NIPAdata,-8),start=c(1949,2),end=c(2020,4))
#convert to matrices
xvarsdf<-data.frame(l1NIPA,l2NIPA,l3NIPA,l4NIPA,l5NIPA,l6NIPA,l7NIPA,l8NIPA)
xvars<-as.matrix(xvarsdf)
NIPAwindow<-matrix(window(NIPAdata,start=c(1949,2),end=c(2020,4)),length(l1NIPA[,1]),4)
#Spline models, yeah?
library(gam) #loads smoothing spline package
library(mgcv) #A slightly different smoothing spline package
lagDlogGDP<-xvars[,1]
DlogGDP<-NIPAwindow[,1]
#Smoothing spline Regression of GDP on lagged GDP
# Default behavior fits penalty parameter by generalized Cross Validation
# Predict GDP growth using smoothing spline in lagged GDP growth
gdp.spl<-smooth.spline(lagDlogGDP,DlogGDP)
splinepredictions<-predict(gdp.spl)
- \(\lambda=\) 7.509843710^{-4} chosen by (non-time series) cross validation gives nonlinear fit
#Build Plot
gdpdata<-data.frame(xvars[,1],NIPAwindow[,1],splinepredictions$x,splinepredictions$y)
colnames(gdpdata)<-c("LDlGDP","DlGDP","xspline","yspline")
ggplot(gdpdata,aes(x=LDlGDP,y=DlGDP))+geom_point()+
geom_line(aes(x=xspline,y=yspline),color="red",size=1)+
geom_smooth(method=lm,se=FALSE,color="blue")+
ggtitle("Smoothing Spline and Linear Fits of This vs Last Quarter's GDP Growth",
subtitle="Spline bends to fit far off points")+
xlab("Change in Log GDP, t-1")+ylab("Change in Log GDP, t")
# An alternate approach which uses "gam" command to fit spline
# Advantage is that it's built in to ggplot
# Disadvantage is that it uses only default options, so not customizable
# Default values here selct much higher smoothing, and give close to linear fit
# ggplot(gdpdata,aes(x=LDlGDP,y=DlGDP))+geom_point()+
# ggtitle("Spline and OLS Fit of This vs Last Quarter's GDP Growth",
# subtitle="There are two lines which exactly overlap")+
# xlab("Change in Log GDP, t-1")+ylab("Change in Log GDP, t")+
# geom_smooth(method=gam,se=FALSE,color="blue")+
# geom_smooth(method=lm,se=FALSE,color="red")
Exotic penalties
- Example penalties I mentioned are just most common ones
- New and different penalties can be used to emphasize different features of the prediction or type of data used
- Thousands in statistics/ML literature, and new ones every day
- Parameter-based penalties can emphasize different sets of parameters or features of size or shape
- Function based penalties can look at fine or large scale features, or multiple variables
- Cases which allow finite computable solution are called (reproducing) kernel methods
- Shared principle: want to trade off flexible form to match patterns in observed data against size and complexity of class
- Different prediction tasks will call for different shapes, sizes of data, so specialized penalties can help
Penalties versus Bounds
- Alternately, can just run ERM over predictor class \(\mathcal{F}\) with restriction that \(\Vert f \Vert \leq c\) for some c
- \(\widehat{f}^{ERM,c}=\underset{f\in\mathcal{F}, \Vert f \Vert\leq c}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t)))\)
- E.g. only look at functions with norm of predictors bounded \(\mathcal{F}_c=\{\beta_0+\sum_{j=1}^d\beta_jz_j:\ \beta\in\mathbb{R}^{d+1}, \Vert\beta\Vert_p\leq c\}\)
- Can choose \(c\) by cross-validation, just like with penalty
- Claim: ERM over class with bound imposed is numerically identical to some penalized empirical risk minimizer
- More formally, for any \(c>0\), there is some \(\widehat{\lambda}(c)\) such that
- \(\widehat{f}^{ERM,c}=\underset{f\in\mathcal{F}}{\arg\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\widehat{\lambda}(c)\Vert f\Vert)\)
- This is application of Lagrange multiplier theorem from constrained optimization
- Slight difference is that \(\widehat{\lambda}(c)\) may depend on the data, whereas it is fixed in penalized case
- Though since \(c\) and \(\lambda\) usually chosen based on data anyway, even this usually not a difference
- Thus, penalization inherits properties of ERM, but with a new, restricted class of predictors
- Explains variety of possible penalties: just different functions of the data
Regularization for GDP Forecasting
library(dplyr) #Data manipulation
library(knitr) #Use knitr to make tables
library(kableExtra) #Extra options for tables
library(tibble) #More data manipulation
#Calculate information criteria
IC<-VARselect(NIPAdata,lag.max=8)
# Fit Optimal VARs by AIC, BIC
AICVAR<-VAR(NIPAdata,p=IC$selection[1])
BICVAR<-VAR(NIPAdata,p=IC$selection[3])
#Construct 1 period ahead foecasts
AICpred<-forecast(AICVAR,h=1)$forecast
BICpred<-forecast(BICVAR,h=1)$forecast
#Extract point forecasts
AICfcsts<-select(data.frame(AICpred),ends_with("Forecast"))
BICfcsts<-select(data.frame(BICpred),ends_with("Forecast"))
#Fit LASSO VAR in all variables up to 8 lags
lassovar<-glmnet(xvars,NIPAwindow,family="mgaussian")
#Show range of lambda values from unpenalized to where all variables gone
#lassovar$lambda
#Show coefficients for large value of lambda (heavy penalization, most coefs gone)
#coef(lassovar,s=0.02)
#Show coefficients for intermediate value of lambda (medium penalization, some coefs gone)
#coef(lassovar,s=0.008)
#Show coefficients for small value of lambda (little penalization, most coefs remain)
#coef(lassovar,s=0.001)
## Fit LASSO VAR with 10-fold cross-validation
lassovarCV<-cv.glmnet(xvars,NIPAwindow,family="mgaussian")
#Lambda giving minimum 10-fold Cross Validation Error
CVlambda<-lassovarCV$lambda.min
#Baroque construction to take last period and format it for predict command
#Somehow need 2 rows in prediction value or it throws a fit
#Presumably because matrix coerced to vector and then it thinks it can't multiply anymore
maxT<-length(NIPAwindow[,1]) #index of last period
newX<-t(as.matrix(data.frame(as.vector(t(NIPAwindow[(maxT-8):(maxT-1),])), as.vector(t(NIPAwindow[(maxT-7):maxT,])))))
#Predict 2021Q1 series
cvlassopredictions<-window(ts(predict(lassovarCV,newx = newX)[,,1],start=c(2020,4),frequency=4),start=c(2021,1))
#Fit L2 Penalized VAR in all variables up to 8 lags
ridgevar<-glmnet(xvars,NIPAwindow,alpha=0,family="mgaussian")
#Show range of lambda values from unpenalized to where all variables gone
#ridgevar$lambda
#Show coefficients for large value of lambda (heavy penalization, coefs shrink a lot)
#coef(ridgevar,s=10)
#Show coefficients for intermediate value of lambda (medium penalization, coefs shrink a little)
#coef(ridgevar,s=1)
#Show coefficients for small value of lambda (little penalization, coefs close to OLS values)
#coef(ridgevar,s=0.01)
## Fit Ridge VAR with 10-fold cross-validation
ridgevarCV<-cv.glmnet(xvars,NIPAwindow,alpha=0,family="mgaussian")
#Lambda giving minimum 10-fold Cross Validation Error
CVlambda2<-ridgevarCV$lambda.min
#Predict 2021Q1 series
cvridgepredictions<-window(ts(predict(ridgevarCV,newx = newX)[,,1],start=c(2020,4),frequency=4),start=c(2021,1))
# #Plot Data Series
# autoplot(window(NIPAdata,start=c(2012,4)))+
# autolayer(cvlassopredictions)+
# autolayer(cvridgepredictions)
# autolayer(AICpred$diff.log.gdp..$mean)+
# autolayer(BICpred$diff.log.gdp..$mean)
- Let’s go back to our GDP forecasting example using history of changes in log of \(Y\), \(C\), \(I\), \(G\)
- Predictor class will be Vector Autoregressions in up to 8 lags of all 4 variables
- 33 parameters per equation, 132 total
- Estimate by AIC, BIC, LASSO, and Ridge
- AIC selects 5 lags, BIC selects 1 lag, reflecting smaller penalty for AIC
- 10-fold cross validation selects \(\lambda=\) 0.0059832 for LASSO, \(\lambda=\) 2.3599158 for ridge
- LASSO keeps just 4 nonzero parameters, while ridge chooses non-zero value for all 132
- Compare 20 for BIC, 84 for AIC, both ordered
#Collect 2021Q1 Forecasts into a data frame
var1table<-data.frame(
t(cvridgepredictions),
t(cvlassopredictions),
t(AICfcsts),
t(BICfcsts)
)
rownames(var1table)<-c("GDP","Consumption","Investment","Government")
#Make Table of Estimated Coefficients
kable(var1table,
col.names=c("Ridge","LASSO","AIC","BIC"),
caption="2021Q1 Forecasts from Different Methods") %>%
kable_styling(bootstrap_options = "striped", font_size = 16)
2021Q1 Forecasts from Different Methods
|
Ridge
|
LASSO
|
AIC
|
BIC
|
GDP
|
0.0151859
|
0.0151859
|
0.0061177
|
0.0165646
|
Consumption
|
0.0153646
|
0.0153646
|
0.0092199
|
0.0179683
|
Investment
|
0.0157096
|
0.0157096
|
0.0118236
|
0.0245480
|
Government
|
0.0152194
|
0.0152194
|
0.0247882
|
0.0108015
|
Ridge Coefficients
#Make Coefficient Table
zzz<-data.frame(as.matrix(coef(ridgevarCV)$y1),
as.matrix(coef(ridgevarCV)$y2),
as.matrix(coef(ridgevarCV)$y3),
as.matrix(coef(ridgevarCV)$y4))
rownames(zzz)<-rownames(data.frame(as.matrix(coef(ridgevarCV)$y1)))
zzz %>%
mutate_all(function(x) {
cell_spec(x, bold = T,
color = spec_color(x))
}) %>%
kable(escape=F,
col.names=c("GDP","C","I","G"),
caption="Coefficients in Ridge Regression") %>%
kable_styling(bootstrap_options = "striped", font_size=10)
Coefficients in Ridge Regression
|
GDP
|
C
|
I
|
G
|
(Intercept)
|
0.0151858897298007
|
0.0153645580354301
|
0.0157096142355653
|
0.0152194105076093
|
diff.log.gdp..
|
2.30031255499118e-37
|
9.38827596242842e-38
|
5.67710819257864e-37
|
4.1908660740873e-37
|
diff.log.pcec..
|
2.03414871590585e-37
|
-1.42003271917175e-38
|
1.03518480719846e-36
|
2.43101694930497e-37
|
diff.log.invest..
|
5.81179178431625e-38
|
4.89847015090077e-38
|
1.72474357338571e-37
|
2.89445812951768e-38
|
diff.log.gce..
|
8.14333315056698e-38
|
4.65662452126473e-38
|
-4.97908409989161e-37
|
6.06876408802504e-37
|
diff.log.gdp…1
|
2.66230340418565e-37
|
2.17899432587549e-37
|
5.89720423452404e-38
|
5.11288435463506e-37
|
diff.log.pcec…1
|
2.45438578253927e-37
|
2.72574002810306e-37
|
-7.46099917642788e-38
|
4.1874800901528e-37
|
diff.log.invest…1
|
6.16468295468514e-38
|
3.76230895303787e-38
|
1.43513553628022e-37
|
4.89935776717903e-38
|
diff.log.gce…1
|
2.88850420892817e-38
|
-2.0343767595241e-39
|
-4.35758744418575e-37
|
4.39598403688305e-37
|
diff.log.gdp…2
|
1.69764387502009e-37
|
1.38669330320158e-37
|
-5.53541951770884e-37
|
6.43937167156774e-37
|
diff.log.pcec…2
|
2.57387614260426e-37
|
2.31769546123258e-37
|
-1.50074670774808e-37
|
4.96432317583687e-37
|
diff.log.invest…2
|
2.05248586117648e-38
|
8.68966908404487e-40
|
-4.0244191573353e-38
|
8.19254014008568e-38
|
diff.log.gce…2
|
-1.52460714936195e-38
|
5.71964171443974e-38
|
-6.40050604726444e-37
|
3.00746325218341e-37
|
diff.log.gdp…3
|
1.02868823860545e-37
|
1.79075449325731e-37
|
-8.72166554492289e-37
|
6.23993172561969e-37
|
diff.log.pcec…3
|
1.91306415158673e-37
|
1.56530919140519e-37
|
-1.39775744944728e-37
|
4.57958482381477e-37
|
diff.log.invest…3
|
3.99606222132951e-39
|
2.86479073484879e-38
|
-2.46853969613774e-37
|
1.07679973010426e-37
|
diff.log.gce…3
|
-1.70497629422387e-38
|
3.24506547360457e-38
|
-2.91745675031562e-37
|
1.4731800906771e-37
|
diff.log.gdp…4
|
-9.47944652278077e-39
|
8.20701352695042e-38
|
-9.8328564921561e-37
|
3.94870645320733e-37
|
diff.log.pcec…4
|
1.41634175467069e-37
|
3.09685052861973e-37
|
-9.95791656490379e-37
|
4.61782046707173e-37
|
diff.log.invest…4
|
-2.58637064333019e-38
|
-3.58456162420379e-38
|
-1.53548027819855e-37
|
6.26430258320072e-38
|
diff.log.gce…4
|
4.46754017745955e-38
|
9.68030481339711e-38
|
-1.69478106557769e-38
|
3.76103120629892e-38
|
diff.log.gdp…5
|
6.29617484125634e-38
|
1.51638436741035e-37
|
-5.19448139677564e-37
|
2.21421305380236e-37
|
diff.log.pcec…5
|
1.29010668326766e-37
|
1.52005958450989e-37
|
-1.5993318690095e-37
|
2.33780106386364e-37
|
diff.log.invest…5
|
-1.59227211270025e-38
|
4.86621283586907e-39
|
-1.84306909715284e-37
|
4.02670363270379e-38
|
diff.log.gce…5
|
1.06431563250505e-37
|
1.13658269140945e-37
|
3.1415103584003e-37
|
5.11064147737253e-38
|
diff.log.gdp…6
|
9.21687780778919e-38
|
1.81069175743602e-37
|
-2.22856911137888e-37
|
1.45859585321776e-37
|
diff.log.pcec…6
|
1.92215622922207e-37
|
3.056783686707e-37
|
-2.1135329401155e-37
|
2.35697554061445e-37
|
diff.log.invest…6
|
-1.47522987657778e-38
|
-6.03934804215541e-39
|
-5.93676171946803e-38
|
3.79610675225019e-39
|
diff.log.gce…6
|
1.01173813993219e-37
|
8.62521387934146e-38
|
3.50630401000713e-37
|
3.64942205833305e-38
|
diff.log.gdp…7
|
1.01868469749412e-37
|
1.87365890427143e-37
|
-1.60968447243311e-37
|
8.74068306105839e-38
|
diff.log.pcec…7
|
1.30416714458953e-37
|
1.9113863912783e-37
|
-7.5849159796442e-38
|
1.84329801657064e-37
|
diff.log.invest…7
|
1.09898866000851e-38
|
2.19118970648559e-38
|
4.91704674317236e-39
|
-1.70444094452624e-38
|
diff.log.gce…7
|
6.045842474227e-38
|
6.63072766928136e-38
|
2.19700121521296e-38
|
7.83684941744905e-38
|
LASSO Coefficients
#Make Coefficient Table
zzz<-data.frame(as.matrix(coef(lassovarCV)$y1),
as.matrix(coef(lassovarCV)$y2),
as.matrix(coef(lassovarCV)$y3),
as.matrix(coef(lassovarCV)$y4))
rownames(zzz)<-rownames(data.frame(as.matrix(coef(lassovarCV)$y1)))
zzz %>%
mutate_all(function(x) {
cell_spec(x, bold = T,
color = if_else(x!=0,"red","black"))
}) %>%
kable(escape=F,
col.names=c("GDP","C","I","G"),
caption="Coefficients in LASSO Regression") %>%
kable_styling(bootstrap_options = "striped", font_size=10)
Coefficients in LASSO Regression
|
GDP
|
C
|
I
|
G
|
(Intercept)
|
0.0151858897298007
|
0.0153645580354301
|
0.0157096142355653
|
0.0152194105076093
|
diff.log.gdp..
|
0
|
0
|
0
|
0
|
diff.log.pcec..
|
0
|
0
|
0
|
0
|
diff.log.invest..
|
0
|
0
|
0
|
0
|
diff.log.gce..
|
0
|
0
|
0
|
0
|
diff.log.gdp…1
|
0
|
0
|
0
|
0
|
diff.log.pcec…1
|
0
|
0
|
0
|
0
|
diff.log.invest…1
|
0
|
0
|
0
|
0
|
diff.log.gce…1
|
0
|
0
|
0
|
0
|
diff.log.gdp…2
|
0
|
0
|
0
|
0
|
diff.log.pcec…2
|
0
|
0
|
0
|
0
|
diff.log.invest…2
|
0
|
0
|
0
|
0
|
diff.log.gce…2
|
0
|
0
|
0
|
0
|
diff.log.gdp…3
|
0
|
0
|
0
|
0
|
diff.log.pcec…3
|
0
|
0
|
0
|
0
|
diff.log.invest…3
|
0
|
0
|
0
|
0
|
diff.log.gce…3
|
0
|
0
|
0
|
0
|
diff.log.gdp…4
|
0
|
0
|
0
|
0
|
diff.log.pcec…4
|
0
|
0
|
0
|
0
|
diff.log.invest…4
|
0
|
0
|
0
|
0
|
diff.log.gce…4
|
0
|
0
|
0
|
0
|
diff.log.gdp…5
|
0
|
0
|
0
|
0
|
diff.log.pcec…5
|
0
|
0
|
0
|
0
|
diff.log.invest…5
|
0
|
0
|
0
|
0
|
diff.log.gce…5
|
0
|
0
|
0
|
0
|
diff.log.gdp…6
|
0
|
0
|
0
|
0
|
diff.log.pcec…6
|
0
|
0
|
0
|
0
|
diff.log.invest…6
|
0
|
0
|
0
|
0
|
diff.log.gce…6
|
0
|
0
|
0
|
0
|
diff.log.gdp…7
|
0
|
0
|
0
|
0
|
diff.log.pcec…7
|
0
|
0
|
0
|
0
|
diff.log.invest…7
|
0
|
0
|
0
|
0
|
diff.log.gce…7
|
0
|
0
|
0
|
0
|
AIC Coefficients
#Make Coefficient Table
zzz<-data.frame(AICVAR$varresult$diff.log.gdp..$coefficients,
AICVAR$varresult$diff.log.pcec..$coefficients,
AICVAR$varresult$diff.log.invest..$coefficients,
AICVAR$varresult$diff.log.gce..$coefficients)
rownames(zzz)<-rownames(data.frame(AICVAR$varresult$diff.log.gdp..$coefficients))
zzz %>%
mutate_all(function(x) {
cell_spec(x, bold = T,
color = if_else(x!=0,"red","black"))
}) %>%
kable(escape=F,
col.names=c("GDP","C","I","G"),
caption="Coefficients in AIC Regression") %>%
kable_styling(bootstrap_options = "striped", font_size=10)
Coefficients in AIC Regression
|
GDP
|
C
|
I
|
G
|
diff.log.gdp…l1
|
-0.746943763057611
|
-0.339280717024153
|
-4.06998274098274
|
-0.115524482447805
|
diff.log.pcec…l1
|
0.542887399420515
|
0.0603927127377116
|
3.52746849769727
|
0.172366474214833
|
diff.log.invest…l1
|
0.147665750520901
|
0.0996000486935814
|
0.628142284010407
|
0.0593200663481155
|
diff.log.gce…l1
|
0.232979968469864
|
0.127013734113583
|
0.546793026132765
|
0.490074188116684
|
diff.log.gdp…l2
|
0.148276807314082
|
0.0030153149531751
|
0.700719436813809
|
0.296184284489155
|
diff.log.pcec…l2
|
-0.0468729316703431
|
0.0689250929776537
|
-0.53805307285928
|
-0.0349076307322074
|
diff.log.invest…l2
|
0.0142184321886127
|
0.0236461267511497
|
-0.0136170202386893
|
-0.0316570502447788
|
diff.log.gce…l2
|
-0.025225073658209
|
-0.102578974706382
|
-0.00283444415118627
|
0.0579397109188032
|
diff.log.gdp…l3
|
0.233299129276807
|
0.193302574074163
|
0.303322310559933
|
0.0908340876606647
|
diff.log.pcec…l3
|
-0.0516608151840398
|
-0.007015396497794
|
-0.1401673205575
|
-0.160195069905048
|
diff.log.invest…l3
|
-0.0585268233889642
|
-0.0584938941415848
|
-0.155597845827494
|
0.0250097255157367
|
diff.log.gce…l3
|
-0.0913173183824062
|
-0.0280589876912641
|
-0.420294633255246
|
0.063747863705081
|
diff.log.gdp…l4
|
0.301371080744814
|
0.120097659148702
|
1.60230119894812
|
0.327298748721874
|
diff.log.pcec…l4
|
0.130764968897992
|
0.185885995556202
|
0.0787419661514829
|
-0.248639740861761
|
diff.log.invest…l4
|
-0.0700006765330574
|
-0.0166441973406796
|
-0.417800332681899
|
-0.00209708631479729
|
diff.log.gce…l4
|
-0.146455512729859
|
-0.0406473284896966
|
-0.67189744156677
|
-0.0470418491889946
|
diff.log.gdp…l5
|
-0.610591245995868
|
-0.169926856449345
|
-2.29212310983113
|
-0.447928309402065
|
diff.log.pcec…l5
|
0.468231178107616
|
0.357972406028735
|
0.894382816968065
|
0.306927940613055
|
diff.log.invest…l5
|
0.0432689297333812
|
-0.0198516764329586
|
0.192061121492337
|
0.0570281247813538
|
diff.log.gce…l5
|
0.186550699795729
|
0.125842023517093
|
0.595685589600826
|
0.0211641039656517
|
const
|
0.00567042047478156
|
0.00620177914475714
|
0.00935819315081179
|
0.00184789666280374
|
BIC Coefficients
#Make Coefficient Table
zzz<-data.frame(BICVAR$varresult$diff.log.gdp..$coefficients,
BICVAR$varresult$diff.log.pcec..$coefficients,
BICVAR$varresult$diff.log.invest..$coefficients,
BICVAR$varresult$diff.log.gce..$coefficients)
rownames(zzz)<-rownames(data.frame(BICVAR$varresult$diff.log.gdp..$coefficients))
zzz %>%
mutate_all(function(x) {
cell_spec(x, bold = T,
color = if_else(x!=0,"red","black"))
}) %>%
kable(escape=F,
col.names=c("GDP","C","I","G"),
caption="Coefficients in BIC Regression") %>%
kable_styling(bootstrap_options = "striped", font_size=10)
Coefficients in BIC Regression
|
GDP
|
C
|
I
|
G
|
diff.log.gdp…l1
|
-0.616078505592147
|
-0.216952693351494
|
-4.11590175514894
|
0.0806110991176697
|
diff.log.pcec…l1
|
0.51185181145174
|
0.0369735342765242
|
3.65466265933299
|
-0.00896787215583153
|
diff.log.invest…l1
|
0.138957227319627
|
0.0887645774343753
|
0.661995796320753
|
0.0499547661798207
|
diff.log.gce…l1
|
0.186098309220188
|
0.10353264493876
|
0.136023745308116
|
0.614133345716592
|
const
|
0.0116164872700755
|
0.0150636667863203
|
0.00984807889662198
|
0.00412528923826631
|
Conclusions
- ERM does well for choosing within a simple hypothesis classes
- Minimizing penalized criterion is a sensible way to improve accounting for overfitting
- Based on number (AIC/BIC) or size (L1/L2) of parameters or functions
- When choosing between multiple procedures, including ERM over multiple classes, can use measure of risk for each
- Penalization acts like restricting size of parameter class
- Can choose penalty by cross-validating
- Goal of regularization is to trade off in-sample and out-of-sample fit
- Other tools exist for finding the “right” descriptive model
References
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. “Foundations of Machine Learning” 2e MIT Press 2018
- Grace Wahba “Spline Models for Observational Data” (1991)
- Overview of smoothing splines and penalization,
- Explains how and why one can work with all differentiable functions