```
library(dagitty) #Library to create and analyze causal graphs
library(ggplot2) #Plotting
suppressWarnings(suppressMessages(library(ggdag))) #library to plot causal graphs
```

- Time series models describe data where samples are ordered in sequence
- Quarterly macroeconomic aggregates like output and prices, monthly unemployment reports, a stock price every second, etc

- The index \(t\) for observations \(\{O_t\}_{t=1}^{T}\) is not arbitrary, but reflects statistical and causal structure
- Time series inference attempts to use historical behavior of data to learn about its properties
- This data type adds additional purely statistical complications
- Observations typically
*dependent*, so one can’t rely on results for independent samples - Dependence structure impacts estimation, inference
- Even in the limit of infinite data, may not be able to learn probability distribution, making “identification” not always relevant

- Observations typically
- Causal structure interacts with statistical structure, introducing additional challenges
- Benefit of time ordering is that, barring time travel, causal order is restricted to be in direction of temporal order
- Causal relationships between variables at different times may induce dependence, which may have different properties for observed and counterfactual variables
- Estimation goal will often be to learn about these causal dependencies

- Today’s class will cover limited range of topics
- Will restrict most attention to cases where either finite sample inference is possible or limited dependence allows use of results that are “close” to the iid setting
- Inferential results under strong dependence require a somewhat different statistical machinery

- Focus on acyclic models and causal estimands coming from a change in a single point in time
- Cyclic models, with General Equilibrium interactions, and interventions to policies, require somewhat different modeling structure

- Time series provide main source of data on outcomes where there may be no comparable units
- Macroeconomic aggregates, market-level financial outcomes, geosciences, etc

- Source of information is instead historical behavior of same series
- Comparison of macroeconomic policies in the 2010s to those in the 1930s, 40s, 50s

- Also useful for performing comparisons within a unit, if observed repeatedly
- Causal effects may be across variables at the same time, or across time
- How does government spending affect future path of GDP?
- How does a rate hike by the monetary authority affect output and prices?

- To obtain causal effects, will need some sources of variation in the policies which are random or “quasi-experimental”
- These may be historical accidents or idiosyncrasies not associated with other sources of variation
- As in cross-sectional case, helpful to have institutional and historical understanding of determinants of treatment
- Ramey (2016) applications make use of detailed narrative record of fiscal and monetary policy-makers as well as statistical properties

- \(\{O_t\}_{t=1}^{T}\) is modeled as a
*stochastic process*: a sequence of random variables indexed by \(t\) - For any \(t\), we refer to \(O_{t-1}\) as the
*lag*of \(O_t\), and \(O^{t}=(O_t,O_{t-1},O_{t-2},...)\) as the*history*of \(O_t\) up to time \(t\), and let \(O_{s:t}=\{O_j\}_{j=s}^{t}\) denote the subsequence between s and t - Our goal will often be to extrapolate from past to future values; for this we may rely on
**stationarity**- A condition where historical patterns are assumed to be identical in distribution to patterns at other times.

- \(\{O_t\}_{t=1}^{T}\) is (strictly) stationary if for any subsequence \(\{O_{t_j}\}_{j=1}^{J}\) and any shift in time \(h\), \(P(\{O_{t_j}\}_{j=1}^{J})=P(\{O_{t_j-h}\}_{j=1}^{K})\)
- Only distance in time between two points affects distribution, not identity of time point

- Rules out trends up or down, shifts or breaks after which distribution changes, predictable seasonal changes, systematic changes in variability, etc
- Data may be stationary only after transformations to remove trends, seasons, breaks
- If \(Y_t=ct+\tilde{Y}_t\) for stationary \(\tilde{Y}_t\), subtracting \(ct\) is called “detrending” and may turn growing series into stable one

- Stationarity will allow us to say something about full series from only part of it
- Can take \(P(O_t)\) \(E[O_t]\), \(P(O_t|O^t)\), etc as quantities which do not depend on \(t\)

- Often also assume causal structure is time-invariant: DAG relating \(O_t\) and \(O^{t-1}\) does not depend on \(t\)
- Implies time-invariant conditional distributions, which are necessary but not sufficient for statistical stationarity
- For causal estimands, which depend on observed and counterfactual outcomes, will rely on stationarity of both

- In order to estimate distributions, would help to have versions of law of large numbers and central limit theorem
- These require stationarity, but also limits on
*dependence*

- These require stationarity, but also limits on
- To see why, consider series which is \((0,0,0,\ldots,)\) w/ prob. 1/2, and \((1,1,1,\ldots,)\) w/ prob. 1/2
- This is stationary, with \(E[O_t]=0.5\), but for all \(T\), \(\frac{1}{T}\sum_{t=1}^{T}O_t\) is \(0\) w/ prob. 1/2, and \(1\) w/ prob. 1/2, and so is bounded away from the mean almost surely

- A stochastic process which satisfies a law of large numbers \(\frac{1}{T}\sum_{t=1}^{T}O_t\overset{p}{\to}E[O_t]\) is said to be
**ergodic** - In order to perform inference one usually asks for a stronger condition, which ensures that dependence over time decays to 0
- Let \(\mathcal{F}_{s}^{t}\) be the space of events (sigma algebra) generated by \(O_{s:t}\), \(s\) or \(t\) possibly infinite
- \(\alpha(j):=\sup\{|P(A\cap B)-P(A)P(B)|:A\in\mathcal{F}_{-\infty}^{t}, B\in\mathcal{F}_{t+j}^{\infty}\}\) is the
**\(\alpha\)-mixing coefficient**of order \(j\)- Value not dependent on \(t\) if sequence stationary
- If \(\alpha(j)\to 0\) as \(j\to\infty\), \(O_t\) is said to be
**\(\alpha\)-mixing**(or*strong mixing*, but that also gets used for a slightly different, related concept)

- Idea: as points get further separated in time, they are closer and closer to independent.
- Different dependency measures yield slightly different but interrelated mixing conditions (\(\beta,\rho,\phi\) etc mixing imply \(\alpha\): see Bradley (2005))

- For our purposes, useful result from mixing is that coefficients bound covariances across time and ensure validity of a central limit theorem
- Primitive conditions implying ergodicity or mixing can be derived for many models of stochastic processes, but are by no means guaranteed
- For linear process relationships, imposes restrictions on coefficients; for nonlinear case, on functional forms

- Let \(Y_t\) be stationary and ergodic, with \(E[|Y_t|^{2r}]<\infty\) for \(r>1\) and \(\alpha-\)mixing with \(\sum_{k}k^{\frac{1}{r-1}}\alpha(k)<\infty\)
- Then we have a Time Series Central limit theorem: \(\frac{1}{\sqrt{T}}\sum_{t=1}^{T}(Y_t-E[Y_t])\overset{d}{\rightarrow}N(0,\Sigma)\)
- Where \(\Sigma\) is the
**Long run Variance**: the limit of the variance of the sum, equal to \(\Sigma=\sum_{h=-\infty}^{\infty}Cov(Y_t,Y_{t+h})=Var(Y_t)+2\sum_{h=1}^{\infty}Cov(Y_t,Y_{t+h})\)- Above conditions from Doukhan, Massart, and Rio (1994), though many minor variants exist: need at minimum to ensure long run variance is finite

- To estimate long run variance, can use (possibly kernel weighted) sum of sample (auto-)covariances \(\widehat{Var}(Y_t)+2\sum_{h=1}^Tk(\frac{h}{S})\widehat{Cov}(Y_t,Y_{t+h})\)
- Newey and West (1987) variance estimates use \(k(u)=(1-|u|)1\{u\leq 1\}\) and are implemented in most statistical software
- Rule of thumb choice \(S=0.75 T^{1/3}\) ensures consistency under (stronger) moment and mixing conditions

- Large but too-rarely-applied literature notes that estimation in error in \(\Sigma\) is substantial, which complicates use, e.g. in t-statistics
- Corrections to critical values based on refined asymptotics and alternate estimators can yield more accurate inference: see Lazarus et al. (2018) for a recent overview

- Alternate approaches to time series inference decompose the series into independent or conditionally independent components
- In the absence of stationarity and mixing, wide variety of limiting behaviors are possible
- Well-studied case is integrated processes, whose
*change*\(Y_t-Y_{t-1}\) is stationary - Ubiquitous when working with asset prices, and anything linked to their levels, due (roughly) to the Fundamental Theorem of Asset Pricing
- Deserves much more attention than I will give it, but a warning to use returns and not levels unless explicitly accounting for possible nonstationarity

- Well-studied case is integrated processes, whose

- Goal is to measure the effect of a treatment variable \(X_t\) at a point in time \(t\) on an outcome variable \(Y_{t+h}\) at some point \(h\geq 0\) periods ahead
- Typically, one will want to measure effects as function of \(h\) to trace out time path
- Called the
**Impulse Response Function**to the shock

- Called the
- More than other topics we have studied, time series has focused on parametric, homogeneous, linear models, and much less is known about heterogeneous effects
- Will see that allowing general forms quickly gives exponentially growing number of arguments, among other problems

- Rambachan and Shephard (2021) give definitions and sufficient identification conditions in terms of potential outcomes for some useful cases
- Identification results do not assume stationarity, though it may help for estimation
- Estimation will require simplifications, so I will afterwards describe estimators for these quantities in linear homogeneous case

- Assume: treatment process \(\{X_t\}_{t=1}^{T}\) with \(X_t\in\mathbb{R}^{d_x}\), outcomes \(\{Y_t\}_{t=1}^{T}\) with \(Y_t\in\mathbb{R}^{d_y}\) with potential outcomes \(\{Y_t^{X_{1:T}=x_{1:T}}\}_{t=1}^{T}\) a function of the entire treatment path, satisfying:
**Non-anticipation**: Let \(Y_t^{x_{1:t}}:=Y_t^{X_{1:T}=x_{1:t},x_{t+1:T}}=Y_t^{X_{1:T}=x_{1:t},x^\prime_{t+1:T}}\) for any \(t\), \(x_{t+1:T}, x^\prime_{t+1:T}\)**Causal consistency**\(\{X_t,Y_t\}_{t=1}^{T}=\{X_t,Y_t^{X_{1:t}}\}_{t=1}^{T}\)

- and
**overlap**\(0<P(X_t=x|X^{t-1},Y^{t-1})<1\) \(\forall x\)

- Define \(Y_{t+h}^{(x_k)}=Y_t^{X_{1:t-1},(X_{1,t},\ldots,x_k,\ldots,X_{d_x,t}),X_{t+1:t+h}}\) the effect of setting the \(k^{th}\) element of \(X_t\) to \(x_k\) on \(Y_{t+h}\) but leaving other elements unchanged
- One possible definition of IRF is \(E[Y_{t+h}^{(x_k)}-Y_{t+h}^{(x_k^\prime)}]\): effect of changing just \(x_{t,k}\)

- A natural guess for an estimator of causal response is the difference in means estimate \(E[Y_{t+h}|X_{k,t}=x_k]-E[Y_{t+h}|X_{k,t}=x_k^\prime]\)

- Rambachan and Shephard (2021) Thm 1 shows it equals \(E[Y_{t+h}^{(x_k)}-Y_{t+h}^{(x_k^\prime)}]+\frac{Cov(Y_{t+h}^{(x_k)},1\{X_{k,t}=x_k\})}{E[1\{X_{k,t}=x_k\}]}-\frac{Cov(Y_{t+h}^{(x_k^\prime)},1\{X_{k,t}=x_k^\prime\})}{E[1\{X_{k,t}=x_k^\prime\}]}\)

- Interpretation: the average difference in means equals the causal effect plus a measure of selection bias
**Proof**- \(E[Y_{t+h}1\{X_{k,t}=x_k\}]=E[Y^{(x_k)}_{t+h}1\{X_{k,t}=x_k\}]\) (Causal consistency and no anticipation)
- \(=Cov(Y^{(x_k)}_{t+h},1\{X_{k,t}=x_k\})+E[Y^{(x_k)}_{t+h}]E[1\{X_{k,t}=x_k\}]\) (def of covariance)
- Divide both sides by \(E[1\{X_{k,t}=x_k\}]\) and apply inverse propensity lemma
- \(E[Y_{t+h}|X_{k,t}=x_k]= E[Y^{(x_k)}_{t+h}]+\frac{Cov(Y^{(x_k)}_{t+h},1\{X_{k,t}=x_k\})}{E[1\{X_{k,t}=x_k\}]}\) for \(x_k\) or \(x_k^\prime\)

- To estimate IRF, need to account for treatment periods being associated with outcome

```
repexdag<-dagify(Y3~X3+Y2+X2+Y1+X1,Y2~X2+Y1+X1,Y1~X1) #create graph
#Set position of nodes so they lie on a straight line
coords<-list(x=c(X1 = 0, Y1 = 0, X2=1, Y2=1, X3=2, Y3=2),
y=c(X1 = 1, Y1 = 0, X2=1, Y2=0, X3=1, Y3=0))
coords_df<-coords2df(coords)
coordinates(repexdag)<-coords2list(coords_df)
ggdag(repexdag, edge_type = "arc")+theme_dag_blank()+labs(title="X randomized in every time period") #Plot causal graph
```

**Corollary**to decomposition: If \(X_{k,t}\perp(\{Y_{t+h}^{x_{1:t+h}}\ \forall x_{1:t+h}\},X_{1:t-1},(X_{1,t},\ldots,x_k,\ldots,X_{d_x,t}),X_{t+1:t+h})\)- then \(Cov(Y_{t+h}^{(x_k)},1\{X_{k,t}=x_k\})=Cov(Y_{t+h}^{(x_k^\prime)},1\{X_{k,t}=x_k^\prime\})=0\)
- and \(E[Y_{t+h}|X_{k,t}=x_k]-E[Y_{t+h}|X_{k,t}=x_k^\prime]\) equals the IRF \(E[Y_{t+h}^{(x_k)}-Y_{t+h}^{(x_k^\prime)}]\)

- Independence of the treatment at time \(t\) and the potential outcomes
*as well as*the rest of the treatment process- Can verify that this is implied by DAG for repeated experiment structure above

- Absence of past
*or future*interactions makes direct conditional mean estimation possible - Variables \(X_t\) with this property sometimes called
*structural shocks*- Rarely observed directly, but sometimes used as building block to construct model of observed variables
- If a shock can be isolated, it can be used to measure sequence of responses

- Consider
**Generalized IRF**\(GIRF_{k,t,h}(x_k,x_k^\prime|X^{t-1},Y^{t-1}):=E[Y_{t+h}|X_{k,t}=x_k,X^{t-1},Y^{t-1}]-E[Y_{t+h}|X_{k,t}=x_k^\prime,X^{t-1},Y^{t-1}]\) - By conditional version of above proof, it equals (Rambachan and Shephard (2021) Thm 4) \(E[Y_{t+h}^{(x_k)}-Y_{t+h}^{(x_k^\prime)}|X^{t-1},Y^{t-1}]+\frac{Cov(Y_{t+h}^{(x_k)},1\{X_{k,t}=x_k\}|X^{t-1},Y^{t-1})}{E[1\{X_{k,t}=x_k\}|X^{t-1},Y^{t-1}]}-\frac{Cov(Y_{t+h}^{(x_k^\prime)},1\{X_{k,t}=x_k^\prime\}|X^{t-1},Y^{t-1})}{E[1\{X_{k,t}=x_k^\prime\}|X^{t-1},Y^{t-1}]}\)
**Corollary**: If \(X_{k,t}\perp(\{Y_{t+h}^{x_{1:t}^{observed},x_{t:t+h}}\ \forall x_{t:t+h}\},(X_{1,t},\ldots,x_k,\ldots,X_{d_x,t}),X_{t+1:t+h})|X^{t-1},Y^{t-1}\), the generalized IRF is \(E[Y_{t+h}^{(x_k)}-Y_{t+h}^{(x_k^\prime)}|X^{t-1},Y^{t-1}]\)- Its average equals the (unconditional) IRF \(E[Y_{t+h}^{(x_k)}-Y_{t+h}^{(x_k^\prime)}]\)
- Interpretation: If conditional on past, treatment is independent of potential outcomes and
*future assignments*, adjustment estimator controlling for history measures causal effect- While slightly weaker than above, this condition will fail if assignments depend on past outcomes or decisions
- It will hold in repeated experiment setting, where treatments are shocks

- Generalized IRF is a measure which allows heterogeneous and nonlinear effects of shocks
- Compute in nonlinear structural dynamic models to summarize effects of shocks

```
seqdag<-dagify(Y4~X4+Y3+X3+Y2+X2+Y1+X1,Y3~X3+Y2+X2+Y1+X1,Y2~X2+Y1+X1,Y1~X1,
X4~Y3+X3+Y2+X2+Y1+X1,X3~Y2+X2+Y1+X1, X2~Y1+X1) #create graph
#Set position of nodes so they lie on a straight line
coords<-list(x=c(X1 = 0, Y1 = 0, X2=1, Y2=1, X3=2, Y3=2, X4=3, Y4=3),
y=c(X1 = 1, Y1 = 0, X2=1, Y2=0, X3=1, Y3=0, X4=1, Y4=0))
coords_df<-coords2df(coords)
coordinates(seqdag)<-coords2list(coords_df)
ggdag(seqdag, edge_type = "arc")+theme_dag_blank()+labs(title="X may depend on all past variables") #Plot causal graph
```