Regularization
- Review of Generalization Problem and ERM
- Structural Risk Minimization Principle
- Cross Validation
- Penalized Estimation
- Penalization vs testing/model evaluation
Risk Minimization
- In statistical learning approach, goal is to produce rule \(f\in\mathcal{F}\) with low risk for true distribution
- \(R_p(f)=E_p\ell(y_{T+h},f(\mathcal{Y}_{T}))\)
- Empirical Risk Minimizer based on minimizing risk for empirical distribution \(\widehat{R}(f)=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))\)
- The difference between what we want and what we achieve is the generalization error
- \(R_p(\widehat{f})=\widehat{R}(\widehat{f})+(R_p(\widehat{f})-\widehat{R}(\widehat{f}))\)
- Data can tell you whether first term is small: minimized by ERM
- Theory required to ensure second term also small
- Can we do better by being more explicit about generalization error?
- Sometimes not much: “Fundamental Theorem of Machine Learning” (Mohri et al 2012)
- Under same conditions that ERM has strong performance bound, no method can do “substantially better”
- In particular, if \(\mathcal{F}\) has “low complexity”, excess risk, which is already small, cannot be made much smaller
- Does this mean we may as well always go with ERM?
- Not really! Especially for small samples \(T\) or classes with “high complexity”
Structural Risk Minimization
- Ideal method would take into account both empirical risk and generalization error
- Impossible to do exactly since generalization error part not known
- But it can be approximated or bounded!
- Idea: Find a term related to generalization error, called a penalty \(\mathcal{J}(f)\)
- Use a forecasting rule, called a Structural Risk Minimizer which minimizes empirical risk plus penalty \[\widehat{f}^{SRM}=\underset{f\in\mathcal{F}}{\arg\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_{t}))+\mathcal{J}(f))\]
- If penalty exactly equals generalization error, this is exactly the oracle estimator
- If penalized risk is a better approximation than the empirical risk, may still outperform ERM
- So, how do we find a good penalty?
- Many proposed methods
- Today: A selection of popular penalties and discussion of their properties
Cross Validation
- We have already seen cross validation as a way to assess forecast quality of a forecasting method \(\widehat{f}(\mathcal{Y}_{t})\)
- Use a forecasting rule constructed on a training sample and estimates risk on different test sample
- May do this for many test and training samples, then average the results
- E.g. Risk estimated from a rolling forecast origin (cf
tsCV
) \[\widehat{R}^{tsCV}(\widehat{f})=\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},\widehat{f}(\mathcal{Y}_{t}))\]
- Many versions exist: test and samples can be blocks, weighted in different ways, etc
- Eg: K-fold CV for ERM over independent data \(\{(y_t,z_t)\}_{t=1}^{T}\)
- Split sample into K sets \(\mathcal{T}_k\) of size \(T/K\)
- For \(k=1\ldots K\) estimate \(\widehat{f}^{(-k)}=\underset{f\in\mathcal{F}}{\arg\min}\frac{1}{T-T/K}\sum_{t\notin\mathcal{T}_k}\ell(y_t,f(z_t))\)
- \(\widehat{R}^{kfoldCV}=\frac{1}{T}\sum_{k=1}^{K}\sum_{t\in\mathcal{T}_k}\ell(y_t,\widehat{f}^{(-k)}(z_t))\)
- For k=T, called “Leave One Out” Cross Validation
- CV reduces bias from overfitting by using different samples for evaluation and testing
Cross validation over multiple hypothesis classes
- Cross validation can also be used as a way to choose between forecasting methods
- Simply choose the method that minimizes the cross-validation estimate of risk
- If methods \(\widehat{f}\) are just fixed functions, for tsCV this is just ERM again
- Case where CV is useful: when we have multiple hypothesis classes of rules \(\{\mathcal{F}_c:c\in\mathcal{C}\}\)
- \(c\) is a tuning parameter affecting some aspect of the procedure, often a measure of complexity
- Eg for AR(p) classes, \(\mathcal{F}_c=\{\beta_0+\sum_{j=1}^{p}\beta_{j}y_{t-j}:\ \beta\in\mathcal{R}^{p+1}\}\), tuning parameter is \(p\)
- For each \(c\in\mathcal{C}\), run ERM over \(\mathcal{F}_c\)
\[\widehat{f}^{c}(\mathcal{Y}_t)=\underset{f\in\mathcal{F}_c}{\arg\min}\frac{1}{t-h}\sum_{s=1}^{t-h}\ell(y_{s+h},f(\mathcal{Y}_{s}))\]
- Choose \(\widehat{c}^{CV}\) by minimizing Cross-Validation risk of \(\widehat{f}^{c}()\) \[\widehat{c}^{CV}=\underset{c\in\mathcal{C}}{\arg\min}\widehat{R}^{CV}(\widehat{f}^{c})\]
- Forecast using \(\widehat{f}^{\widehat{c}^{CV}}\)
Evaluating Cross Validation
- Cross validation over multiple classes not the same as ERM because \(\widehat{f}^{c}\) is not a fixed function but depends on the data
- Often, classes are ordered: \(\mathcal{F}_{c_1}\subset \mathcal{F}_{c_2}\) if \(c_1<c_2\)
- Compare small classes of low complexity to big classes of high complexity: More lags, more covariates, etc
- Low complexity classes may have high empirical risk but low error on test set, vice versa for high complexity classes
- Cross validation accounts for these differences: won’t always pick the highest complexity class
- Still have some approximation error, because test set distribution \(\neq\) future population distribution
- Error increases in complexity of \(\mathcal{C}\), which is usually small: one or two parmeters
- Worse when training and test samples are dependent: helped slightly by using tsCV over k-fold
- In general, provides fairly good approximation to generalization error but can be extremely costly to compute
- Solve an optimization problem for each \(c\in\mathcal{C}\) for each \(t\) in sample
- Can be hundreds or thousands of times as many computations as a single ERM forecast
- Variants or special cases which are less costly but provide similar or not much worse performance may be preferred
Evaluation and implementation
- AIC and BIC designed to pick a rule with low risk
- BIC has slightly larger penalty: downgrades more complex models more
- Under some very strong conditions, AIC “close to” true risk, at least when T is large
- BIC instead designed to pick “true model” when T is large
- For forecasting usually care about formaer rather than latter goal
- For this reason, Hyndman & Athanasopolous text recommends AIC
- But be cautious: result requires large samples, and in small samples AIC can be bad measure of risk
- For this reason, Diebold text recommends BIC, as slightly larger penalty accounts for chance of overfitting from misestimation of risk
- Both estimates usually provide worse measure of risk than cross validation
- Sometimes faster: \(\approx T\) times fewer optimizations
- Exception is linear models, where known formula simplifies CV calculations
- Automatically reported with
Arima
, VAR
, ets
, etc, or ex post by CV
- AIC also part of
auto.arima
, to choose between orders of AR
VARselect
computes AIC/BIC for VAR over loss \(\log\det(\sum_{t=1}^{T-h}(y_{t+h}-\widehat{y}_{t})(y_{t+h}-\widehat{y}_{t})^\prime)\)
Parameter size based penalties
- Penalization based on number of parameters require solving model for each parameter set \(c\in\mathcal{C}\)
- Fine if only a small number: once per lag order of AR/VAR up to max lag, hard if sets \(\mathcal{F}_c\) are every possible subset of p regressors: \(2^p\) regressions
- Works if best predictions come from ignoring all but a subset of predictors, and leaving rest unrestricted
- Alternative: allow all predictors to matter, but reduce total size of parameters by penalizing their norm
- Linear hypothesis classes \(\mathcal{F}=\{\beta_0+\sum_{j=1}^m\beta_j z_j:\ \beta\in\mathbb{R}^{m+1}\}\), penalty is \(\lambda\Vert\beta\Vert^p_p\)
- Norm \(\Vert\beta\Vert_p\) is a measure of “size” of coefficients, e.g.
- L1 (LASSO) Penalty \(\Vert\beta\Vert^1_1=\sum_{j=0}^d\|\beta_j\|\)
- L2 (Ridge) Penalty \(\Vert\beta\Vert^2_2=\sum_{j=0}^d(\beta_j)^2\)
- Mixed L1/L2 (Elastic Net) Penalty \(\Vert\beta\Vert_{1/2,a}=a\Vert\beta\Vert^1_1+(1-a)\Vert\beta\Vert^2_2\) for \(a\in[0,1]\)
- \(\lambda\geq 0\) decides size of penalty term
- Large value means strong restrictions: coefficients shrink a lot and class much simpler
- Small value allows more complexity, at cost of greater overfitting risk. 0 means no restrictions: estimator is just ERM
- Penalized estimator solves \(\widehat{f}^\lambda=\underset{f\in\mathcal{F}}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\lambda\Vert\beta\Vert_p^p)\)
Evaluation and Implementation
- Norm penalized estimators address overfitting by preventing large fluctuations in values of parameters
- Type of norm controls which kinds of fluctuations are more or less restricted
- L2 penalty more strongly affected by very large values, less by small values
- All values partially shrink
- L1 penalty equally strongly affected by small and large values
- Allows some larger coefficients, but shrinks a lot smaller ones, often to exactly 0: sparsity
- Acts like penalties on # of parameters, but non-0 coefficients also slightly shrunk
- Elastic net in between, by amount controlled by \(a\): small terms go to 0, but big terms also shrunk
- For a fixed penalty size \(\lambda\) computation is easy: one instead of many optimizations
- Scale problem: how to choose \(\lambda\)?
- Typically, find \(\widehat{f}^\lambda\) for many \(\lambda\) and select \(\widehat{\lambda}^{CV}\) by cross validation
- Efficient LARS algorithm makes computation for all \(\lambda\) fast in linear model
- Implemented for linear hypotheses and square loss, cross-entropy, multivariate square loss, etc in
glmnet
cv.glmnet
implements cross validation parameter selection, but not time series version
What kinds of model classes?
- Penalizing size of parameters can substitute for using smaller number of parameters
- Can include more predictors for same level of generalization error if penalized
- Can use many predictors, or nonlinear functions of predictors: polynomials, splines, etc
- Benefit is better ability to predict and get close to oracle predictor
- Cost is that values of parameters get distorted
- How complicated can model class be?
- Depends on number of possible relevant predictors and how big complicated a model within that class
- LASSO can handle more parameters than data points, if most are (close to) 0
- Up to exponentially many if very sparse and large \(\lambda\) chosen
- Ridge can handle arbitrary number of predictors
- But effects of each become progressively more muted
- In practice, use LASSO with many weakly related predictors, of which only some might matter
- Use ridge with many, possibly related predictors, each of which contributes a little bit to the forecast
- Often believable for lags, so common to use in (V)AR estimates
Function based penalties
- Rather than penalizing parameters of predictor \(f()\), you can penalize \(f\) directly
- Penalized estimator solves, for \(\lambda\Vert f\Vert\) some norm on \(f\),
- \(\widehat{f}^\lambda=\underset{f\in\mathcal{F}}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\lambda\Vert f\Vert)\)
- For classes \(\mathcal{F}\) defined by parameters, like linear hypothesis class, little essential difference
- (Usually) equivalent to some penalty on parameters
- But if class \(\mathcal{F}\) isn’t dependent on a (finite) set of parameters, eg \(\mathcal{F}=\{\text{all functions of Z}\}\), can be very different
- Most prominent example: mean squared second derivative penalty \(\Vert f\Vert=\int f^{\prime\prime}(z)^2dz\)
- Penalizes “wiggliness” regardless of particular shape: minimal for straight line
- Amazing fact (cf Wahba 1990): penalized minimizer over class of all functions has known, computable form
- Called a “Smoothing Spline”: piecewise cubic polynomial between data points
- Equivalent to ridge penalty over (T-dimensional) class of piecewise cubic polynomials
- Implemented as
smooth.spline
as well as in specialized functions in libraries gam
or mgcv
- Choose \(\lambda\) by CV again: fast algorithm known for non-time series case
Smoothing Splines: GDP growth from Lagged GDP growth
library(fpp2) #Forecasting
library(glmnet) #Lasso and Ridge
library(vars) #Vector Autoregressions
library(fredr) #Access FRED Data
##Obtain and transform NIPA Data (cf Lecture 08)
fredr_set_key("8782f247febb41f291821950cf9118b6") #Key I obtained for this class
#US GDP components
GDP<-fredr(series_id = "GDP",
observation_start = as.Date("1947-01-01")) #Gross Domestic Product
#US GDP components from NIPA tables (cf http://www.bea.gov/national/pdf/nipaguid.pdf)
PCEC<-fredr(series_id = "PCEC",
observation_start = as.Date("1947-01-01")) #Personal consumption expenditures
FPI<-fredr(series_id = "FPI",
observation_start = as.Date("1947-01-01")) #Fixed Private Investment
CBI<-fredr(series_id = "CBI",
observation_start = as.Date("1947-01-01")) #Change in Private Inventories
NETEXP<-fredr(series_id = "NETEXP",
observation_start = as.Date("1947-01-01")) #Net Exports of Goods and Services
GCE<-fredr(series_id = "GCE",
observation_start = as.Date("1947-01-01")) #Government Consumption Expenditures and Gross Investment
#Format the series as quarterly time series objects, starting at the first date
gdp<-ts(GDP$value,frequency = 4,start=c(1947,1),names="Gross Domestic Product")
pcec<-ts(PCEC$value,frequency = 4,start=c(1947,1),names="Consumption")
fpi<-ts(FPI$value,frequency = 4,start=c(1947,1),names="Fixed Investment")
cbi<-ts(CBI$value,frequency = 4,start=c(1947,1),names="Inventory Growth")
invest<-fpi+cbi #Private Investment
netexp<-ts(NETEXP$value,frequency = 4,start=c(1947,1),names="Net Exports")
gce<-ts(GCE$value,frequency = 4,start=c(1947,1),names="Government Spending")
#Convert to log differences to ensure stationarity and collect in frame
NIPAdata<-ts(data.frame(diff(log(gdp)),diff(log(pcec)),diff(log(invest)),diff(log(gce))),frequency = 4,start=c(1947,2))
# Construct lags
l1NIPA<-window(stats::lag(NIPAdata,-1),start=c(1949,2),end=c(2018,3))
l2NIPA<-window(stats::lag(NIPAdata,-2),start=c(1949,2),end=c(2018,3))
l3NIPA<-window(stats::lag(NIPAdata,-3),start=c(1949,2),end=c(2018,3))
l4NIPA<-window(stats::lag(NIPAdata,-4),start=c(1949,2),end=c(2018,3))
l5NIPA<-window(stats::lag(NIPAdata,-5),start=c(1949,2),end=c(2018,3))
l6NIPA<-window(stats::lag(NIPAdata,-6),start=c(1949,2),end=c(2018,3))
l7NIPA<-window(stats::lag(NIPAdata,-7),start=c(1949,2),end=c(2018,3))
l8NIPA<-window(stats::lag(NIPAdata,-8),start=c(1949,2),end=c(2018,3))
#convert to matrices
xvarsdf<-data.frame(l1NIPA,l2NIPA,l3NIPA,l4NIPA,l5NIPA,l6NIPA,l7NIPA,l8NIPA)
xvars<-as.matrix(xvarsdf)
NIPAwindow<-matrix(window(NIPAdata,start=c(1949,2),end=c(2018,3)),length(l1NIPA[,1]),4)
#Spline models, yeah?
library(gam) #loads smoothing spline package
lagDlogGDP<-xvars[,1]
DlogGDP<-NIPAwindow[,1]
#Smoothing spline Regression of GDP on lagged GDP
# Default behavior fits penalty parameter by generalized Cross Validation
# Predict GDP growth using smoothing spline in lagged GDP growth
gdp.spl<-smooth.spline(lagDlogGDP,DlogGDP)
splinepredictions<-predict(gdp.spl)
- \(\lambda=\) 0.573289 chosen by (non-time series) cross validation gives almost exactly linear fit
#Build Plot
gdpdata<-data.frame(xvars[,1],NIPAwindow[,1],splinepredictions$x,splinepredictions$y)
colnames(gdpdata)<-c("LDlGDP","DlGDP","xspline","yspline")
ggplot(gdpdata,aes(x=LDlGDP,y=DlGDP))+geom_point()+
geom_line(aes(x=xspline,y=yspline),color="red",size=1)+
ggtitle("Smoothing Spline Fit of This vs Last Quarter's GDP Growth",
subtitle="Almost exactly linear due to large optimal penalty for nonlinearity")+
xlab("Change in Log GDP, t-1")+ylab("Change in Log GDP, t")
# An alternate approach which uses "gam" command to fit spline
# Advantage is that it's built in to ggplot
# Disadvantage is that it uses only default options, so not customizable
# In this case, results are exactly the same
# ggplot(gdpdata,aes(x=LDlGDP,y=DlGDP))+geom_point()+
# ggtitle("Spline and OLS Fit of This vs Last Quarter's GDP Growth",
# subtitle="There are two lines which exactly overlap")+
# xlab("Change in Log GDP, t-1")+ylab("Change in Log GDP, t")+
# geom_smooth(method=gam,se=FALSE,color="blue")+
# geom_smooth(method=lm,se=FALSE,color="red")
Exotic penalties
- Example penalties I mentioned are just most common ones
- New and different penalties can be used to emphasize different features of the prediction or type of data used
- Thousands in statistics/ML literature, and new ones every day
- Parameter-based penalties can emphasize different sets of parameters or features of size or shape
- Function based penalties can look at fine or large scale features, or multiple variables
- Cases which allow finite computable solution are called (reproducing) kernel methods
- Shared principle: want to trade off flexible form to match patterns in observed data against size and complexity of class
- Different prediction tasks will call for different shapes, sizes of data, so specialized penalties can help
Penalties versus Bounds
- Alternately, can just run ERM over predictor class \(\mathcal{F}\) with restriction that \(\Vert f \Vert \leq c\) for some c
- \(\widehat{f}^{ERM,c}=\underset{f\in\mathcal{F}, \Vert f \Vert\leq c}{\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t)))\)
- E.g. only look at functions with norm of predictors bounded \(\mathcal{F}_c=\{\beta_0+\sum_{j=1}^d\beta_jz_j:\ \beta\in\mathbb{R}^{d+1}, \Vert\beta\Vert_p\leq c\}\)
- Can choose \(c\) by cross-validation, just like with penalty
- Claim: ERM over class with bound imposed is numerically identical to some penalized empirical risk minimizer
- More formally, for any \(c>0\), there is some \(\widehat{\lambda}(c)\) such that
- \(\widehat{f}^{ERM,c}=\underset{f\in\mathcal{F}}{\arg\min}(\frac{1}{T-h}\sum_{t=1}^{T-h}\ell(y_{t+h},f(\mathcal{Y}_t))+\widehat{\lambda}(c)\Vert f\Vert)\)
- This is application of Lagrange multiplier theorem from constrained optimization
- Slight difference is that \(\widehat{\lambda}(c)\) may depend on the data, whereas it is fixed in penalized case
- Though since \(c\) and \(\lambda\) usually chosen based on data anyway, even this usually not a difference
- Thus, penalization inherits properties of ERM, but with a new, restricted class of predictors
- Explains variety of possible penalties: just different functions of the data
Regularization for GDP Forecasting
library(dplyr) #Data manipulation
library(knitr) #Use knitr to make tables
library(kableExtra) #Extra options for tables
library(tibble) #More data manipulation
#Calculate information criteria
IC<-VARselect(NIPAdata,lag.max=8)
# Fit Optimal VARs by AIC, BIC
AICVAR<-VAR(NIPAdata,p=IC$selection[1])
BICVAR<-VAR(NIPAdata,p=IC$selection[3])
#Construct 1 period ahead foecasts
AICpred<-forecast(AICVAR,h=1)$forecast
BICpred<-forecast(BICVAR,h=1)$forecast
#Extract point forecasts
AICfcsts<-select(data.frame(AICpred),ends_with("Forecast"))
BICfcsts<-select(data.frame(BICpred),ends_with("Forecast"))
#Fit LASSO VAR in all variables up to 8 lags
lassovar<-glmnet(xvars,NIPAwindow,family="mgaussian")
#Show range of lambda values from unpenalized to where all variables gone
#lassovar$lambda
#Show coefficients for large value of lambda (heavy penalization, most coefs gone)
#coef(lassovar,s=0.02)
#Show coefficients for intermediate value of lambda (medium penalization, some coefs gone)
#coef(lassovar,s=0.008)
#Show coefficients for small value of lambda (little penalization, most coefs remain)
#coef(lassovar,s=0.001)
## Fit LASSO VAR with 10-fold cross-validation
lassovarCV<-cv.glmnet(xvars,NIPAwindow,family="mgaussian")
#Lambda giving minimum 10-fold Cross Validation Error
CVlambda<-lassovarCV$lambda.min
#Baroque construction to take last period and format it for predict command
#Somehow need 2 rows in prediction value or it throws a fit
#Presumably because matrix coerced to vector and then it thinks it can't multiply anymore
newX<-t(as.matrix(data.frame(as.vector(t(NIPAwindow[270:277,])), as.vector(t(NIPAwindow[271:278,])))))
#Predict 2018Q3 series
cvlassopredictions<-window(ts(predict(lassovarCV,newx = newX)[,,1],start=c(2018,3),frequency=4),start=c(2018,4))
#Fit L2 Penalized VAR in all variables up to 8 lags
ridgevar<-glmnet(xvars,NIPAwindow,alpha=0,family="mgaussian")
#Show range of lambda values from unpenalized to where all variables gone
#ridgevar$lambda
#Show coefficients for large value of lambda (heavy penalization, coefs shrink a lot)
#coef(ridgevar,s=10)
#Show coefficients for intermediate value of lambda (medium penalization, coefs shrink a little)
#coef(ridgevar,s=1)
#Show coefficients for small value of lambda (little penalization, coefs close to OLS values)
#coef(ridgevar,s=0.01)
## Fit Ridge VAR with 10-fold cross-validation
ridgevarCV<-cv.glmnet(xvars,NIPAwindow,alpha=0,family="mgaussian")
#Lambda giving minimum 10-fold Cross Validation Error
CVlambda2<-ridgevarCV$lambda.min
#Predict 2018Q3 series
cvridgepredictions<-window(ts(predict(ridgevarCV,newx = newX)[,,1],start=c(2018,3),frequency=4),start=c(2018,4))
# #Plot Data Series
# autoplot(window(NIPAdata,start=c(2012,4)))+
# autolayer(cvlassopredictions)+
# autolayer(cvridgepredictions)
# autolayer(AICpred$diff.log.gdp..$mean)+
# autolayer(BICpred$diff.log.gdp..$mean)
- Let’s go back to our GDP forecasting example using history of changes in log of \(Y\), \(C\), \(I\), \(G\)
- Predictor class will be Vector Autoregressions in up to 8 lags of all 4 variables
- 33 parameters per equation, 132 total
- Estimate by AIC, BIC, LASSO, and Ridge
- AIC selects 5 lags, BIC selects 1 lag, reflecting smaller penalty for AIC
- 10-fold cross validation selects \(\lambda=\) 0.0026361 for LASSO, \(\lambda=\) 0.2138204 for ridge
- LASSO keeps 32 nonzero parameters, while ridge chooses non-zero value for all 132
- Compare 8 for BIC, 24 for AIC, but different choices from LASSO
#Collect 2018Q4 Forecasts into a data frame
var1table<-data.frame(
t(cvridgepredictions),
t(cvlassopredictions),
t(AICfcsts),
t(BICfcsts)
)
rownames(var1table)<-c("GDP","Consumption","Investment","Government")
#Make Table of Estimated Coefficients
kable(var1table,
col.names=c("Ridge","LASSO","AIC","BIC"),
caption="2018Q4 Forecasts from Different Methods") %>%
kable_styling(bootstrap_options = "striped", font_size = 16)
2018Q4 Forecasts from Different Methods
|
Ridge
|
LASSO
|
AIC
|
BIC
|
GDP
|
0.0135764
|
0.0136399
|
0.0114672
|
0.0146535
|
Consumption
|
0.0145109
|
0.0150744
|
0.0114084
|
0.0137662
|
Investment
|
0.0141675
|
0.0096423
|
0.0147074
|
0.0200288
|
Government
|
0.0112001
|
0.0129239
|
0.0109967
|
0.0142961
|
Ridge Coefficients
#Make Coefficient Table
zzz<-data.frame(as.matrix(coef(ridgevarCV)$y1),
as.matrix(coef(ridgevarCV)$y2),
as.matrix(coef(ridgevarCV)$y3),
as.matrix(coef(ridgevarCV)$y4))
rownames(zzz)<-rownames(data.frame(as.matrix(coef(ridgevarCV)$y1)))
zzz %>%
mutate_all(function(x) {
cell_spec(x, bold = T,
color = spec_color(x))
}) %>%
kable(escape=F,
col.names=c("GDP","C","I","G"),
caption="Coefficients in Ridge Regression") %>%
kable_styling(bootstrap_options = "striped", font_size=10)
Coefficients in Ridge Regression
GDP
|
C
|
I
|
G
|
0.00786403925303112
|
0.00788037586534336
|
0.0148332433942633
|
0.00238697200569118
|
0.0865957812778728
|
0.0551696965390487
|
0.244556969538606
|
0.0942090494910388
|
0.109959994411296
|
0.0140186300705327
|
0.652435086528036
|
0.0271147781051111
|
0.0123857804601417
|
0.0146716727038759
|
0.0263673253409791
|
0.00451694512000495
|
0.0106451182691282
|
-0.000384426841756296
|
-0.10326596491934
|
0.129021260675576
|
0.0621757521642605
|
0.0507324374153117
|
0.0591602476118126
|
0.0915652117483974
|
0.0685132890145487
|
0.0872907695940012
|
0.0201799922326903
|
0.0757765565334758
|
0.00980081641152812
|
0.00612229583374719
|
0.0264123256900968
|
0.00341895148384484
|
-0.00091364858455345
|
-0.0183671961947785
|
-0.0452636148290377
|
0.0728980049798622
|
0.00222284364167427
|
0.000153134673311346
|
-0.128645457778679
|
0.0690242604269709
|
0.0281940450636521
|
0.0281374237646931
|
-0.0171630376438789
|
0.0241099553055666
|
-0.000742106997763265
|
-0.00393726953375673
|
-0.0148823646684543
|
0.00991134660015443
|
-0.0117103950975583
|
0.00390704975726952
|
-0.124060760551806
|
0.042365378160212
|
0.00466436718790951
|
0.0198055138385727
|
-0.11583119132355
|
0.0696018152897233
|
0.0261427471133706
|
0.0118711044130767
|
0.0680154676403516
|
0.0163018495414982
|
-0.00167161870530863
|
0.00573285857652937
|
-0.0541920520539569
|
0.0168684912766791
|
-0.00858261369081641
|
-0.000107206981960806
|
-0.0348551138082085
|
0.0112101644913383
|
-0.0232727042683797
|
-0.00429662421540572
|
-0.167820490545611
|
0.0172672049947349
|
0.00391586808909369
|
0.0427292362405572
|
-0.200572063884623
|
0.0349239157634509
|
-0.00664742359679361
|
-0.0104016875491473
|
-0.0190333240075376
|
0.00595507032903712
|
0.00538793204213564
|
0.0128902984939667
|
0.02284614151604
|
-0.0114685479470082
|
0.00544131900827601
|
0.0144369960515328
|
-0.0303801828996868
|
-0.00363175835548809
|
0.0164253253291509
|
0.018465576565109
|
0.0345073433829858
|
-0.0126745355744008
|
-0.00190203067553394
|
5.72598991601035e-05
|
-0.0190653286699528
|
0.00404537508886109
|
0.0197841229410908
|
0.0160085909139529
|
0.0842755170708496
|
0.00830309472111045
|
0.0120147540294824
|
0.0243376461687585
|
-0.0165184864873954
|
0.00719547800775078
|
0.040296252839528
|
0.0596415983729323
|
0.02774714220003
|
0.0114532053990684
|
-0.0024274892654373
|
-0.0029465709365635
|
0.00107392698826947
|
-0.00148227405584982
|
0.0143235423939472
|
0.0095686832241577
|
0.0646786514144629
|
0.00189470282102794
|
0.00654277888835529
|
0.0260428281389034
|
-0.0835351764832077
|
0.00556255654741
|
0.00740736376961867
|
0.0193445802289786
|
-0.0299715588564189
|
0.0107470604622051
|
0.00244101702686358
|
0.00315971586949353
|
0.00301646433765767
|
-0.00377054987395623
|
-0.0048771079424408
|
-0.000351952561224234
|
-0.0574670100128757
|
0.010569599656017
|
LASSO Coefficients
#Make Coefficient Table
zzz<-data.frame(as.matrix(coef(lassovarCV)$y1),
as.matrix(coef(lassovarCV)$y2),
as.matrix(coef(lassovarCV)$y3),
as.matrix(coef(lassovarCV)$y4))
rownames(zzz)<-rownames(data.frame(as.matrix(coef(lassovarCV)$y1)))
zzz %>%
mutate_all(function(x) {
cell_spec(x, bold = T,
color = if_else(x!=0,"red","black"))
}) %>%
kable(escape=F,
col.names=c("GDP","C","I","G"),
caption="Coefficients in LASSO Regression") %>%
kable_styling(bootstrap_options = "striped", font_size=10)
Coefficients in LASSO Regression
GDP
|
C
|
I
|
G
|
0.0100302004924729
|
0.0129798197145501
|
0.00168310308800569
|
0.00756634626175604
|
0
|
0
|
0
|
0
|
0.349959458943998
|
0.130476409727349
|
1.56554483539441
|
0.175646554336095
|
0
|
0
|
0
|
0
|
0.0189475717511366
|
0.00825380129907898
|
-0.200786258486333
|
0.211575231022215
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.00317825905700754
|
0.00415960881665426
|
-0.0480514970684653
|
0.026735352282484
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
-0.000579291079324646
|
0.00107644280868163
|
-0.0116456790667297
|
0.00480000882876038
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0.000140488939523953
|
0.00533714789277771
|
-0.0489610167268404
|
0.0198565317923048
|
0
|
0
|
0
|
0
|
-0.0248114736550768
|
0.00358810219897533
|
-0.246221267350789
|
0.0430723482194091
|
0.00499345322235313
|
0.0241926374897687
|
-0.102657716554927
|
0.0268686037123245
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
AIC Coefficients
#Make Coefficient Table
zzz<-data.frame(AICVAR$varresult$diff.log.gdp..$coefficients,
AICVAR$varresult$diff.log.pcec..$coefficients,
AICVAR$varresult$diff.log.invest..$coefficients,
AICVAR$varresult$diff.log.gce..$coefficients)
rownames(zzz)<-rownames(data.frame(AICVAR$varresult$diff.log.gdp..$coefficients))
zzz %>%
mutate_all(function(x) {
cell_spec(x, bold = T,
color = if_else(x!=0,"red","black"))
}) %>%
kable(escape=F,
col.names=c("GDP","C","I","G"),
caption="Coefficients in AIC Regression") %>%
kable_styling(bootstrap_options = "striped", font_size=10)
Coefficients in AIC Regression
GDP
|
C
|
I
|
G
|
0.0142423802331418
|
0.335379954233719
|
-1.66567424374834
|
-0.0458342811491882
|
0.412581978510042
|
-0.145870908410638
|
3.63202838343984
|
0.143896666113282
|
0.0274378660517431
|
-0.0156497514807466
|
0.287226644977188
|
0.042002491417178
|
0.0643892203157539
|
-0.0233881252807245
|
0.0174160286406101
|
0.472850970012959
|
0.229659888723582
|
0.269399415045273
|
-0.143262388462611
|
0.351609194937956
|
0.0702970032085799
|
0.173522014804767
|
-0.0975467195269845
|
0.0158414129010053
|
-0.0171415190305722
|
-0.0323539077927016
|
0.0475055185498714
|
-0.0412019143794856
|
-0.0420896176116723
|
-0.158539060486606
|
0.177660920656683
|
0.0443322148452024
|
-0.0530503159805621
|
-0.105453086053541
|
-0.417634348755004
|
0.0518474932711676
|
0.0417016846902634
|
0.159668297181366
|
-0.283642111390404
|
-0.128775449751194
|
-0.0209115301435752
|
-0.0200053835083251
|
-0.0549504273458671
|
0.030162245635792
|
-0.0244971810971676
|
0.0310810988172909
|
-0.186714858168813
|
0.072437242832677
|
0.194846774896595
|
-0.0456390218211937
|
1.69673325899881
|
0.291405949868535
|
0.0438158991226434
|
0.118599212590941
|
-0.318354688568522
|
-0.266886732092782
|
-0.0436234814162472
|
0.0124385525781957
|
-0.371750797774807
|
0.00287302373823534
|
-0.0939728362880406
|
0.0256517076994152
|
-0.641177916328012
|
-0.0349723904845601
|
-0.375654283804274
|
0.0714799076733018
|
-1.78345775966185
|
-0.427496739634625
|
0.214299611678092
|
0.0808915731483531
|
0.421133953050024
|
0.275045376137453
|
0.0179499285089324
|
-0.0456210321684576
|
0.139224678770525
|
0.055414493021536
|
0.115882613339792
|
0.0566722429832534
|
0.423603489217224
|
0.0137362064712486
|
0.00327979811668631
|
0.00411587418405196
|
0.00146212088776681
|
0.00125004595756258
|
BIC Coefficients
#Make Coefficient Table
zzz<-data.frame(BICVAR$varresult$diff.log.gdp..$coefficients,
BICVAR$varresult$diff.log.pcec..$coefficients,
BICVAR$varresult$diff.log.invest..$coefficients,
BICVAR$varresult$diff.log.gce..$coefficients)
rownames(zzz)<-rownames(data.frame(BICVAR$varresult$diff.log.gdp..$coefficients))
zzz %>%
mutate_all(function(x) {
cell_spec(x, bold = T,
color = if_else(x!=0,"red","black"))
}) %>%
kable(escape=F,
col.names=c("GDP","C","I","G"),
caption="Coefficients in BIC Regression") %>%
kable_styling(bootstrap_options = "striped", font_size=10)
Coefficients in BIC Regression
GDP
|
C
|
I
|
G
|
0.121952811213594
|
0.443954168101468
|
-1.96535758395511
|
0.127651524807346
|
0.407141812161502
|
-0.1140622757727
|
3.58351024967781
|
0.00635501513565612
|
0.0354252888148711
|
-0.00443586493533627
|
0.360026691647868
|
0.0435240022800619
|
0.00925600061424106
|
-0.0503410167955548
|
-0.389919533571099
|
0.598248368556429
|
0.00650643901357845
|
0.0114194256502141
|
-0.00914962148014678
|
0.0035845689397845
|
Evaluating Autoregressive Forecasts
- Forecast residuals from optimal forecast \(\widehat{e}_{t+1}=y_{t+1}-\widehat{y}_{t+1}\) are white noise in autoregressive model
- This is in fact true when forecast rule is Bayes predictor in square loss for any stationary distribution
- Holds for VARs, regression, etc
- So if you are using a rule which is approximately the Bayes rule, residuals should be close to white noise
- ACF of \(\widehat{e}_{t+1}\) should be close to 0 everywhere
- Use
checkresiduals
to inspect and compare various measures of distance
- Usually little reason to believe chosen family contains Bayes forecast rule
- May not be exactly linear, or may depend on information further back than number of lags used
- Or even if it did that approximation is close enough that all properties of Bayes forecast hold for approximate rule
- These two issues are in conflict
- Can always increase complexity of hypothesis class to try to get closer to Bayes forecaster
- But doing so can worsen overfitting, making rule actually chosen further away
- Residual checks measure closeness of fit, but are unrelated to degree of overfitting
- Generally not a reliable tool for finding a good forecast, as opposed to checking ex post
- Prefer measures which account for all types of error
Application to GDP Series in BIC VAR Prediction
checkresiduals(BICVAR$varresult$diff.log.gdp..) #GDP Growth forecast residual diagnostic
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 36.01, df = 10, p-value = 8.386e-05
- Patterns in residual plot and ACF, and formal test of white noise, suggest VAR(1) is not super close to white noise
- Suggests patterns in data may differ somewhat from VAR(1) model: in-sample fit could be improved
- To decide whether this or another forecast should be used, should compare loss measures
Conclusions
- ERM does well for choosing within a simple hypothesis classes
- Minimizing penalized criterion is a sensible way to improve accounting for overfitting
- Based on number (AIC/BIC) or size (L1/L2) of parameters or functions
- When choosing between multiple procedures, including ERM over multiple classes, can use measure of risk for each
- Penalization acts like restricting size of parameter class
- Can choose penalty by cross-validating
- Goal of regularization is to trade off in-sample and out-of-sample fit
- Other tools exist for finding the “right” descriptive model
References
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. “Foundations of Machine Learning” MIT Press 2012
- Grace Wahba “Spline Models for Observational Data” (1991)
- Overview of smoothing splines and penalization,
- Explains how and why one can work with all differentiable functions