#Libraries for causal graphs
library(dagitty) #Identification
library(ggdag) #plotting
#General Libraries
library(ggplot2) #Graphics
library(gridExtra) #Graph Display
Plans
Structural equations models provide a complete framework for
answering questions about observed, interventional, or counterfactual
quantities
Given a model definition, a characterization of distributions of
variables as defined by the “solution” of that model, and
characterization of causation as that distribution in a modified model,
you can derive formulas for distribution of any observed or
counterfactual quantity
Given a quantity defined in terms of features of the causal
model, identification corresponds to finding a formula
in terms of observed quantities which is identical to the causal model
quantity, or certifying that no such formula exists
- E.g., causal quantity is ATE \(E[Y^{X=x_1}-Y^{X=x_0}]\), identification
formula is \(\int
(E[Y|X=x_1,Z=z]-E[Y|X=x_0,Z=z])dP(z)\)
Actually deriving the identification result may be challenging:
search space can be large or infinite
- For restricted model classes, relationships can be described by set
of rules, via which search can be automated
- This has been done comprehensively for acyclic Structural Causal
Model with independent errors
- Representable by a causal Directed Acyclic Graph or
DAG
Today: Brief summary of known identification results for DAGs
- Model assumptions let you convert qualitative reasoning about which
variables are related to which others and how into estimation formulas
and observables implications
- Will only reference extensions to models with weaker or stronger
assumptions
Most immediate payoff is a framework for reasoning about
conditional ignorability \(Y^x\perp X
|Z\) called backdoor criterion
- Highlights reasons why a regression may or may not recover a causal
effect
Secondarily, generate alternative estimation formulas and model
tests
Since methods automated, focus will be on interpreting
assumptions and results, less on derivations
Structural Causal Models and DAGs, review
- Endogenous variable \(Y_1,Y_2,\ldots,Y_p\) described by
Structural Equation Model \[Y_1=f_1(Y_2,\ldots,Y_p,U_1)\] \[Y_2=f_2(Y_1,Y_3,\ldots,Y_p,U_2)\] \[\vdots\] \[Y_p=f_p(Y_1,Y_2,\ldots,Y_{p-1},U_p)\]
- Exogenous \((U_1,\ldots,U_p)\sim\Pi_{j=1}^{p}F_j(u_j)\)
mutually independent
- Variables \(Y_1,\ldots,Y_j\)
encoded as nodes \(V\) in graph \(G=(V,E)\)
- Presence of \(Y_j\) in \(f_i\) indicates \(Y_j\) directly affects \(Y_i\)
- Encoded as edge \(Y_j\to Y_i\) in
\(E\in V\times V\)
- “Acyclic”: no directed path (sequence of connected edges with common
orientation) from a vertex to itself
- “Nonparametric”: graph topology encodes only presence or absence of
connection
- “Solve”: Define \((Y_1,\ldots,Y_p)\) as unique values that
solve system given \(U\)’s
- “\(do(Y_j=x)\)”: Replace \(f_j\) by \(x\), solve. New values are \((Y_1^{Y_j=x},\ldots,
Y_{j-1}^{Y_j=x},x,Y_{j+1}^{Y_j=x},\ldots,Y_{p}^{Y_j=x})\)
Causal Markov Condition
- Solving an acyclic structural model gives joint distribution of
endogenous variables
- What properties does the joint distribution have?
- Causal Markov property: A variable \(Y_j\) is independent of any variable that
is not a descendant, conditional on its parents
- \(Y_k\) is a
parent of \(Y_j\) if
there is a directed edge from \(Y_k\)
to \(Y_j\). \(pa(Y_j)\) is the set of parents of \(Y_j\)
- \(Y_k\) is a
descendant of \(Y_j\)
if there is a path along directed edges from \(Y_j\) to \(Y_k\)
- Property completely defines implications of causal graph
- Absence of an edge means conditional independence
- Implies that joint distribution factorizes according to
topological order of graph
- \(P(Y_1,\ldots,Y_p)=\Pi_jP(Y_j|pa(Y_j))\)
- When intervening causally, \(f_j\)
is deleted but rest of structure, including distributions, remains the
same
- Joint distribution given \(do(Y_j=x)\) is \(P(Y_1,\ldots,Y_p|do(Y_j=x))=(\delta_{Y_j=x})\Pi_{i\neq
j}P(Y_i|pa(Y_i))\)
- \(\delta_{Y_j=x}\) is point mass at
\(x\), every other part is
unchanged
- Ratio of \(P(Y_1,\ldots,Y_p|do(Y_j=x))/P(Y_1,\ldots,Y_p)=\frac{\delta_{Y_j=x}}{P(Y_j|pa(Y_j))}\)
immediately recovers Inverse Probability Weighting formula
- IPW using parents is valid estimator for any causal effect
Conditioning and d-separation
- When does conditioning on set of nodes \(Z\) imply for two disjoint sets of nodes
\(X\), \(Y\) that \(X\perp
Y|Z\)?
- Depends on structure of paths between \(X\), \(Y\): sequence of connected edges,
not necessarily oriented
- 3 nodes in a path can be linked in one of 3 ways: Fork, Chain,
Collider
edgetypes<-list()
forkgraph<-dagify(Y~Z,X~Z) #create graph
#Set position of nodes
coords<-list(x=c(X = 0, Z = 1, Y = 2),y=c(X = 0, Z = 0, Y = 0))
coords_df<-coords2df(coords)
coordinates(forkgraph)<-coords2list(coords_df)
edgetypes[[1]]<-ggdag(forkgraph)+theme_dag_blank()+ggplot2::labs(title="Fork Structure") #Plot causal graph
chaingraph<-dagify(Y~Z,Z~X) #create graph
#Set position of nodes
coords<-list(x=c(X = 0, Z = 1, Y = 2),y=c(X = 0, Z = 0, Y = 0))
coords_df<-coords2df(coords)
coordinates(chaingraph)<-coords2list(coords_df)
edgetypes[[2]]<-ggdag(chaingraph)+theme_dag_blank()+ggplot2::labs(title="Chain Structure") #Plot causal graph
collidergraph<-dagify(Z~Y,Z~X) #create graph
#Set position of nodes
coords<-list(x=c(X = 0, Z = 1, Y = 2),y=c(X = 0, Z = 0, Y = 0))
coords_df<-coords2df(coords)
coordinates(collidergraph)<-coords2list(coords_df)
edgetypes[[3]]<-ggdag(collidergraph)+theme_dag_blank()+ggplot2::labs(title="Collider structure") #Plot causal graph
grid.arrange(grobs=edgetypes,nrow=3,ncol=1) #Arrange In 3x1 grid

- We say a (non-directed) path from \(X_i\) to \(Y_j\) is blocked by \(Z\) if it contains either
- A collider which is not in \(Z\) and of which \(Z\) is not a descendant
- A non-collider which is in \(Z\)
- We say \(X\) and \(Y\) are d-separated (by
\(Z\)) in graph \(G\) (denoted \((X\perp Y |Z)_G\) if all paths in \(G\) between \(X\) and \(Y\) are blocked
- Theorem (Pearl
(2009)): \(X\perp Y|Z\) (in all
distributions consistent with \(G\)) if
\(X\) and \(Y\) are d-separated by \(Z\)
- Further, if \(X\) and \(Y\) are not d-separated, \(X\) and \(Y\) are dependent in at least one
distribution compatible with the graph
Colliders and Selection
- Conditioning on fork or chain breaks association along a path
- Having a common consequence (collider) does not create
correlation between independent events
- But knowing Y and common consequence Z, can infer about other causes
X
- Among college students, those with rich family did not need high
grades to get in.
- Observing rich family, can infer that likelihood of high grades was
lower
- True even if in full population, grades and wealth independent
set.seed(123) #Reproduce same simulation each time
observations<-2000
grades<-rnorm(observations)
wealth<-rnorm(observations)
#Grades and wealth influence admission score
admitscore<-grades+wealth+0.3*rnorm(observations)
#Admit top 10% of applicants by score
threshhold<-quantile(admitscore,0.9)
admission<-(admitscore > threshhold)
#Make plot of conditional and unconditional relationship
simdata<-data.frame(grades,wealth,admission)
ggplot(simdata)+geom_point(aes(x=wealth,y=grades,color=admission))+
#Regress y on x with no controls
#lm(grades~wealth)
geom_smooth(aes(x=wealth,y=grades),method="lm",color="black")+
#Regress y on x and w (with interaction)
#lm(grades~wealth+admission+I(wealth*admission))
geom_smooth(aes(x=wealth,y=grades,group=admission),method="lm",color="blue")+
ggplot2::labs(title="Grades vs Wealth, with Admission as Collider",
subtitle="Black Line: Unconditional. Blue Lines: Conditional on Admission")

Causal effects and their identification: do-calculus
- d-separation describes conditional independence within a
single graph
- Relationships between observed and modified graphs, can
apply small set of rules, called do-calculus
- Given a graph \(G\) and disjoint
sets of nodes \(X\), \(Z\) define:
- \(G_{\underline{X}}\): \(G\) with all edges going out of
\(X\) deleted
- \(G_{\bar{X}}\): \(G\) with all edges going into
\(X\) deleted
- \(G_{\underline{X}\bar{Z}}\): \(G\) with all edges going out of \(X\) and all edges going into \(Z\) deleted
- For \(X,Y,Z,W\) disjoint sets of
nodes in DAG \(G\), 3 rules of
do-calculus
- Insertion/deletion of observations:
- If \((Y\perp Z|
X,W)_{G_{\bar{X}}}\), \(P(Y|do(X=x),Z=z,W=w)=P(Y|do(X=x),W=w)\)
- Action/observation exchange
- If \((Y\perp Z|
X,W)_{G_{\bar{X}\underline{Z}}}\), \(P(Y|do(X=x),do(Z=z),W=w)=P(Y|do(X=x),Z=z,W=w)\)
- Insertion/deletion of actions.
- Let \(Z(W)\) be set of \(Z\) that are not ancestors of \(W\) in \(G_{\bar{X}}\).
- If \((Y\perp Z|
X,W)_{G_{\bar{X}\bar{Z}(W)}}\), \(P(Y|do(X=x),do(Z=z),W=w)=P(Y|do(X=x),W=w)\)
- Roughly: (1) says d-separation gives conditional independence, (2)
says doing \(Z\) is same as seeing
\(Z\) if there are no unblocked
“backdoor” paths into Z connecting it into \(Y\), (3) says \(Z\) has no effect if all paths out of \(Z\) are blocked
- Repeated application of these rules to remove “do” from conditioning
can fully characterize any identification claim based on causal
graph
Backdoor Criterion
- Using d-separation, can define a complete criterion for when control
recovers \(P(Y|do(X=x))\)
- A set of variables \(Z\) satisfies
Backdoor Criterion between \(X\) and \(Y\) if
- No node in \(Z\) is a descendant of
\(X\)
- \(Z\) blocks every path between
\(X\) and \(Y\) that contains an edge directed into
\(X\) (ie \((Y\perp X| Z)_{G_{\underline{X}}}\))
- Theorem (Pearl
(2009)): If \(Z\) satisfies the
backdoor criterion between \(X\) and
\(Y\), the adjustment formula recovers
the causal effect of \(X\) on \(Y\) \[P(Y|do(X=x))=\int
P(Y|X=x,Z=z)P(Z=z)dz\]
- Proof: Via do-calculus
- \(P(Y|do(X=x))=\int
P(Y|do(X=x),Z=z)P(Z=z|do(X=x))dz\) (L.I.E)
- \(=\int P(Y|do(X=x),Z=z)P(Z=z)dz\)
(Rule 3, since \((Z\perp
X)_{G_\bar{X}}\) since \(Z\)
contains no descendants of \(X\))
- \(=\int P(Y|X=x,Z=z)P(Z=z)dz\)
(Rule 2 since \((Y\perp X|
Z)_{G_{\underline{X}}}\))
- For alternate proofs, see Pearl (2009)
Ch 3.3 or 11.3
Backdoor criterion: intuition
- Blocking path component ensures that adjustment variables account
for confounding by causes other than the cause of interest and do not
introduce additional bias by inducing new correlations through
colliders
- Non-descendant component avoids “bad controls” which are themselves
affected by treatment
- d-separation or backdoor criterion hard to check “by hand”
- Point is that given a causal story, systematic method can recover
whether control is sufficient
- In
dagitty, check backdoor criterion between \(X\) and \(Y\) by
#Check if Z satisfies criterion
isAdjustmentSet(graphname,"Z",exposure="X",outcome="Y")
#Find variables that satisfy criterion, if they exist
adjustmentSets(graphname,exposure="X",outcome="Y")
Example: 1: Conditions for finding adjustment sets
examplegraph<-dagify(Y~A+C,B~X+Y,A~X,X~C,D~B) #create graph
#Set position of nodes
coords<-list(x=c(X = 0, A = 1, B = 1, C=1, Y=2, D=2),y=c(X = 0, A = 0.1, B=0, C=-0.1, Y = 0, D=0.1))
coords_df<-coords2df(coords)
coordinates(examplegraph)<-coords2list(coords_df)
ggdag(examplegraph)+theme_dag_blank()+ggplot2::labs(title="Example Graph") #Plot causal graph

- Education \(X\) versus wages \(Y\) to get intuition
- \(A\) caused by \(X\), causes \(Y\): e.g., occupation, experience
- Mediators: descendant of \(X\), so
do not adjust for it
- \(B\) caused by both \(X\) and \(Y\): e.g., current wealth or lifestyle
- Colliders: descendant of \(X\), so
do not adjust for it
- \(D\) caused by \(B\) only: e.g. consequences of wealth
- Descendants of collider: still causes bias when adjusted for
- \(C\), causes \(X\) and \(Y\): e.g. ability, background
- Confounder: must condition on it
- Backdoor criterion calculates this automatically
adjustmentSets(examplegraph,exposure="X",outcome="Y")
## { C }
Example 2: Controlling for a descendant
mediatorgraph<-dagify(Y~Z,Z~X) #create graph
#Set position of nodes
coords<-list(x=c(X = 0, Z = 1, Y = 2),y=c(X = 0, Z = 0, Y = 0))
coords_df<-coords2df(coords)
coordinates(mediatorgraph)<-coords2list(coords_df)
ggdag(mediatorgraph)+theme_dag_blank()+ggplot2::labs(title="Mediator structure") #Plot causal graph

- The non-descendant part warns us that controlling for
consequences may remove part of effect we want to measure
- Sometimes we have access to variables \(Z\) called mediators
- Caused by treatment \(X\) and also
affect outcome
- Executioner shoots (X) Bullet hits (Z) Prisoner dies (Y)
- \(P(Y|do(X=x))\) asks what is
outcome when \(X\) happens: \(=P(Y|X=x)\)
- \(Z\) is correlated with \(X\) and \(Y\), but don’t want to
control for it
- Adjustment formula gives 0 effect of \(X\) on \(Y\) if \(Z\) controlled for
- \(\int P(Y|X,Z=z)P(z)dz=\int
P(Y|Z=z)P(Z=z)dz=P(Y)\neq P(Y|do(X=x))\)
- Conditional on being hit by a bullet, being shot at has no
relationship with death
- Changing \(X\) does affect \(Y\), but indirectly through \(Z\)
- Common example: \(X\) protected
attribute, \(Y\) hiring/promotion/wages
- Common to hear “the wage gap disappears if you control for…” (long
list \(Z_1,\ldots,Z_k\))
- Common for that list to contain possible results of discrimination
due to being in protected group \(X\)
(occupation, rank in company,…)
Aside: Well-defined or manipulable treatments
- Causal interpretation of discrimination controversial for another
reason, that it’s not always clear what it would mean to manipulate
\(X\)
- Problem is that while recorded as categorical, group status (race,
gender, religion,…) often a proxy for a bundle of attributes
(Sen and Wasow (2016))
- Different manipulations may alter some but not all parts of bundle
- E.g. resume audit studies, per Bertrand and
Mullainathan (2004), measure effect of having name Lakisha instead
of Emily on your resume, but not, say, replacing a person in one group
with a (comparable?) person in another group
- Ambiguity may be reduced by including in model collection of
attributes making up measured variable and relationships between them
- Add nodes and arrows for name, accent, appearance, family history,
etc
- Less fraught example: \(BMI=weight/height\) can imagine changing by
altering numerator or denominator.
- And within weight, altering number by diet, exercise, surgery,
traveling to outer space, etc.
- Each of these parts is a possible node, as are more detailed
processes making them up
- If full structure does not interact with other parts of causal
graph, may be able to summarize by role of proxy node
Example 3: M-bias: controlling for a collider
mgraph<-m_bias() #A common teaching example, so it has a canned command
ggdag(mgraph)+theme_dag_blank()+ggplot2::labs(title="Graph illustrating m-bias")

adjustmentSets(mgraph,exposure="x",outcome="y",type="all")
## {}
## { a }
## { b }
## { a, b }
## { a, m }
## { b, m }
## { a, b, m }
- The “non-collider” qualification for d-separation shows a
counterexample to claim that controlling for “pre-treatment” variables
is safe
- Conditioning on \(m\) creates path
between \(x,y\) that are not causally
linked otherwise, because their antecedents influence a shared outcome
\(m\)
- \(m\) may come before or after x
and y in time
- Ding and Miratrix (2015) simulations
show this bias is often small relative to omitting a confounder, since
it is product of effects
- But useful to clarify that being “after” treatment is not only way
for a control to be bad.
Multiple causal effects: the table 2 Fallacy
- With linear adjustment, regression coefficient on \(X\) interpretable as ATE
- What about other coefficients in regression?
- A coefficient is interpretable as causal effect if backdoor
criterion holds with respect to all other variables
- Often, your treatment will be descendant of controls
- In that case backdoor holds for X given Z but not for Z given X
- \(Y^x\perp X | Z\) but \(Y^z \not \perp Z| X\)
- Traditionally, main regression is Table 2 in paper (Table 1 is
summary stats)
- Table 2 fallacy is to try to interpret every coefficient in this
table as causally meaningful
- Try to explain sign/magnitude of every coefficient with causal
story
- Ubiquitous in papers 2+ decades ago, still common today
- Many papers now guard against this by only reporting main effect
- Maybe good idea to put other coefficients in an appendix
Example 4: Efficiency: Instrumental Variables and
Nonconfounders
irrelevantgraph<-dagitty("dag{Y<-X; Y<-Z1; Y<-Z2; X<-Z2; X<-Z3; Z4}") #create graph
#Set position of nodes
coords<-list(x=c(X = 0, Z3 = 0, Z2 = 0.5, Z4 = 0.5, Z1=1, Y=1),y=c(X = 0, Z1 = 0.5, Z2 = 0.25, Z4=0.5, Z3=0.5, Y=0))
coords_df<-coords2df(coords)
coordinates(irrelevantgraph)<-coords2list(coords_df)
ggdag(irrelevantgraph)+theme_dag_blank()+ggplot2::labs(title="Graph with confounders and more",subtitle = "Z1 affects Y but not X, Z3 affects X not Y, Z2 affects both, Z4 irrelevant") #Plot causal graph

adjustmentSets(irrelevantgraph,exposure="X",outcome="Y",type="all")
## { Z2 }
## { Z1, Z2 }
## { Z2, Z3 }
## { Z1, Z2, Z3 }
## { Z2, Z4 }
## { Z1, Z2, Z4 }
## { Z2, Z3, Z4 }
## { Z1, Z2, Z3, Z4 }
- What do we do if we have options? Here we only need to
condition on \(Z2\)
Efficiency vs robustness
- Efficiency: which controls produce low variance estimates?
- In OLS, controlling for variables related to outcome but not
treatment reduces residual variance
- Always helps to include \(Z1\)
- Controlling for variables related to treatment but not outcome
reduces treatment variation, raises variance
- Generally hurts to include \(Z3\)
- Variables with no relationship (\(Z4\)) don’t matter asymptotically but add
noise in finite samples, raise dimension in nonparametric case
- In general graph, smallest variance described by Witte et al. (2020) (for regression or
AIPW)
- Robustness: What happens if edge structure not right
- In original graph safe to control for everything
- If \(Z1,Z3,Z4\) have edges added to
make them confounders, not just safe but necessary
- Belloni, Chernozhukov, and Hansen
(2014) “Double Selection”: If doing variable selection (eg by
Lasso) find Zs that predict \(X\) and
those that predict \(Y\), include union
in final regression
- Sparsity: if \(\#\) of irrelevant
\(Z\)s large relative to \(n\), need to get rid of them to have any
hope
- Double selection guards against missing edges in finite samples,
which invalidates inference
- R library
hdm (Chernozhukov,
Hansen, and Spindler (2016))
- If some confounders are unobserved, no control strategy is
consistent, but some worse than others
- Controlling for instruments like \(Z3\) may “amplify” bias
Example 5: Chains
sequencegraph<-dagify(Y~X+Z1, Z1~Z2, Z2~Z3, X~Z3) #create graph
#Set position of nodes
coords<-list(x=c(X = 0, Z3 = 0, Z2 = 0.5, Z1=1, Y=1),y=c(X = 0, Z1 = 0.5, Z2 = 0.5, Z3=0.5, Y=0))
coords_df<-coords2df(coords)
coordinates(sequencegraph)<-coords2list(coords_df)
ggdag(sequencegraph)+theme_dag_blank()+ggplot2::labs(title="Graph with confounders in a chain") #Plot causal graph

adjustmentSets(sequencegraph,exposure="X",outcome="Y",type="all")
## { Z1 }
## { Z2 }
## { Z1, Z2 }
## { Z3 }
## { Z1, Z3 }
## { Z2, Z3 }
## { Z1, Z2, Z3 }
- If a backdoor path is in sequential order (eg due to necessary
sequencing), only need to condition on enough links to break the
chain
- Which to control for?
- If you only observe one link, just use that one: fine even if rest
unobserved
- If you observe many, choice again made based on robustness vs
efficiency considerations
Alternative identification strategies
- The adjustment formula and identification by backdoor are not only
ways to estimate effects in a causal model
- Alternative strategies exist and can be found algorithmically
- do calculus gives rules for converting to a formula
- The algorithms take your model extremely seriously
- Any missing edge implies a conditional independence relationship
that can be exploited
- Each missing edge deserves serious thought, and maybe a paragraph in
your paper
- But if believed, can use new methods
- Most formulas derived this way are bespoke to your application,
often hard to interpret
- Some have simple enough form to be widely applicable
- Examples of the latter
- Front door adjustment
- Sequential confounding
- Mediation formulas
- Postpone discussing last two, but illustrate concept with front
door
Front Door criterion
Front Door graph
- Most prominent example of effect with identification formula derived
algorithmically from graph
- Target is ATE of X on Y: not identified by adjustment due to latent
confounder
- Restore identification by observing mediator \(Z\) with two properties
- Conditional randomness: \(Z\)
conditionally random given \(X\)
- Exclusion: \(X\) has no direct
effect on \(Y\)
- Justifying these takes work, but similar to IV criteria
- Bellemare, Bloem, and Wexler (2020):
\(X\) opt in to share ride, \(Z\) actually share (random based on Lyft
availability), \(Y\) Tip
- Formula is \(P\left(Y|do(X)\right) =
\sum_{Z}{P\left(Z \middle| X\right)\sum_{X'}{P\left(Y \middle|
X',Z\right)P\left(X'\right)}}\)
- Steps
- Regress \(Y\) on \(X,Z\) to get \(Z\) effect (identified since X blocks
backdoor)
- Regress \(Z\) on \(X\) to get \(X\) effect (identified since \(Z\) unconfounded)
- Average Z effect over levels of \(X\) to get \(X\) effect (if linear, multiply Y on Z coef
in step 1 by Z on X coef in 2)
- Derivation: Press a button at https://causalfusion.net/ and it will even write the
paper for you
Adding assumptions to Structural Causal Models
- Researchers in this area actively working to characterize
identification results for different settings
- Adding assumptions beyond independence, like on functional forms,
gives additional identifying restrictions
- Linearity: Wright (1934) originally
introduced structural equations with all effects linear as “method of
path analysis”
- Causal effect along a directed path is product of edge weights
- Total effect is sum of path specific effects
- Additional effects become identifiable, e.g. via instrumental
variables, or cases with cycles
- Discrete: Very recently Duarte et al.
(2021) automate finding bounds on effects
- Multiple data sources, each with partially modified graph or set of
observables
- E.g., some data from experiment, some from surveys, some but not all
edges or observables shared
- Bareinboim and Pearl (2016), Hünermund and Bareinboim (2019) characterize
what’s identified
Removing assumptions from Structural Causal Models
- Work seeks to make setting less restrictive by eliminating
assumptions
- Partial observability, unknown edge orientation, selection
- With unobserved variables, may want identification based on DAG
marginalized to only observables
- “Mixed graph”: Represent a link passing through a latent variable
with bidirected edges \(<->\)
- Without experiments direction of edges (usually) not distinguishable
based on observations alone
- Additional edge types can represent equivalence classes with
multiple possible directions
- We may also want to reason about conditional distribution
- Mainly because selection into our data set means we only
have data conditional on selection variable
- Medical case: effect of drug on blood pressure conditional
on not being dead
- Surveys/experiments: “loss to follow-up” outcome measure missing for
people who can’t/wont answer survey
- Economic selection: only see grades for students who go to
institution/take test, etc
- Selection nodes encode variables determining selection into
data
- Identification algorithms in these settings and more exist, and more
settings every week
Estimation
- For adjustment, can use regression/IPW/AIPW estimators from last
class
- Backdoor tells you which variables to include in \(Z\) to get conditional ignorability
- For other formulas generated by a DAG, need to come up with
estimator associated with identification formula
- Linear models: Path coefficients in linear SEM estimable by linear
GMM, often OLS or 2SLS
- IPW principle is general: estimate ratio of causal distribution to
observed distribution and weight the sample average
- “marginal structural models” minimize prediction error/maximize
likelihood in inverse probability weighted sample
- AIPW type estimates are possible in many cases
- Bhattacharya, Nabi, and Shpitser
(2020) derive such formulas for large class of estimands
Building and testing your DAG
- Warning: many economists are wary of or hostile to DAG approaches
- Imbens (2020) provides clear
exposition of this position
- Roughly, concern is that for credible effect estimation, you should
only rely on a minimal set of plausible assumptions, ideally from
(quasi-)experiments
- DAGs make it easy to incorporate many more assumptions than are
plausible in a given setting
- Solution to this: only rely on causal assumptions you believe
- Allow hidden variables, edges of unknown direction, encode just the
“credible” independence conditions in edges
- May require using generalizations devised for exactly this:
MAGs/PAGs/CPDAGs/ADMGs, etc
- Benefit of model is you can algorithmically derive testable
implications
- Can and should use these to make sure proposed model not
contradicted by data
- Same machinery also lets you derive not-yet-testable implications
- What does model say for real or hypothetical data you don’t have?
(e.g., experiments)
- Use this, along with priors to think about whether assumptions
reasonable
- Systematic approach: start with fully solid assumptions, prune down
space of possibilities by testing those testable implications
- Leads to causal discovery algorithms: PC, FCI, LiNGaM, NOTEARS,
etc
- Warning: many of these start with debatable assumptions
like no hidden variables, linearity, or “faithfulness”
- As a Catholic school graduate, I can point you to theology venues if
you want to try to publish a paper relying on faithfulness
Software
| dagitty |
R, online |
identification |
| ggdag |
R |
identification, plotting |
| pcalg |
R |
discovery, estimation |
| dowhy |
Python |
identification, estimation |
| ananke |
Python |
identification, estimation |
| causalfusion |
online |
identification, estimation, derivation |
Conclusion
- Structural causal models encode full descriptions of relationships
between variables
- Represent by graphs, and use graphical criteria to reason about
implications
- Obtain conditional implications by d-separation
- Derive identification formulas by do-calculus
- Determine whether conditioning measures causal effect by using
backdoor criterion
- Illustrates both when to control and when not to
- With full description, implications can be reduced to algorithms, so
use software implementations
- Use model to encode assumptions that are believable, then derive
implications to identify, test, and estimate
References
Bareinboim, Elias, and Judea Pearl. 2016. “Causal Inference and
the Data-Fusion Problem.” Proceedings of the National Academy
of Sciences 113 (27): 7345–52.
Bellemare, Marc F, Jeffrey R Bloem, and Noah Wexler. 2020. “The
Paper of How: Estimating Treatment Effects Using the Front-Door
Criterion.”
Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014.
“Inference on Treatment Effects After Selection Among
High-Dimensional Controls.” The Review of Economic
Studies 81 (2): 608–50.
Bertrand, Marianne, and Sendhil Mullainathan. 2004. “Are Emily and
Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor
Market Discrimination.” American Economic Review 94 (4):
991–1013.
Bhattacharya, Rohit, Razieh Nabi, and Ilya Shpitser. 2020.
“Semiparametric Inference for Causal Effects in Graphical Models
with Hidden Variables.” arXiv Preprint arXiv:2003.12659.
Chernozhukov, Victor, Chris Hansen, and Martin Spindler. 2016.
“Hdm: High-Dimensional Metrics.” arXiv Preprint
arXiv:1608.00354.
Ding, Peng, and Luke W Miratrix. 2015. “To Adjust or Not to
Adjust? Sensitivity Analysis of m-Bias and Butterfly-Bias.”
Journal of Causal Inference 3 (1): 41–57.
Duarte, Guilherme, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and
Ilya Shpitser. 2021.
“An Automated Approach to Causal Inference in
Discrete Settings.” https://arxiv.org/abs/2109.13471.
Hünermund, Paul, and Elias Bareinboim. 2019. “Causal Inference and
Data Fusion in Econometrics.” arXiv Preprint
arXiv:1912.09104.
Imbens, Guido W. 2020. “Potential Outcome and Directed Acyclic
Graph Approaches to Causality: Relevance for Empirical Practice in
Economics.” Journal of Economic Literature 58 (4):
1129–79.
Pearl, Judea. 2009. Causality. Cambridge university press.
Sen, Maya, and Omar Wasow. 2016. “Race as a Bundle of Sticks:
Designs That Estimate Effects of Seemingly Immutable
Characteristics.” Annual Review of Political Science 19:
499–522.
Witte, Janine, Leonard Henckel, Marloes H Maathuis, and Vanessa Didelez.
2020. “On Efficient Adjustment in Causal Graphs.”
Journal of Machine Learning Research 21: 246.
Wright, Sewall. 1934.
“The Method of Path Coefficients.”
Ann. Math. Statist. 5 (3): 161–215.
https://doi.org/10.1214/aoms/1177732676.