Heterogeneous Effects
- Most tools described so far estimate a low dimensional summary of response to intervention
- Typically, an average potential outcome \(E[Y^x]\) (in our usual notation)
- Running theme of identification results has been avoidance of assumptions of homogeneity of effects
- Move past linear additive model, which is typically justified only by convenience
- If this is a concern, it would be nice to measure this heterogeneity
- How much do outcomes vary and how are they distributed?
- Can we predict this variation in the outcomes?
- These two tasks lead to two estimation challenges
- Outcome distribution modeling
- Conditional treatment effect estimation
- Heterogeneity matters to extent we care about features beyond average
- Preferences over inequality across individuals or risk over outcomes may make distribution inherently interesting
- Through aggregation, it may also impact group-level outcomes
- Prediction of effect heterogeneity may be used in interpretation, exploration, and decision-making
- Fundamentally a mix of causal and description or prediction problems
- Arguably, main use is as input into one of several possible fully causal problems, so applicability depends on intended use
- Different targets if goal is multivariate causal effect, specification of causal pathways, or policy choice
Identifying Outcome Distributions
- Almost every method we have described for identifying an average potential outcome can be trivially converted into a method for identifying the distribution of potential outcomes
- Replace \(Y\) with \(G(Y,u)\) for collection of functions \(\{G(.,u)\ u\in\mathcal{U}\}\)
- Identification result for \(E[Y^x]\) also gives value for \(E[G(Y^x,u)]\) for any \(u\in\mathcal{U}\) if expectation is finite
- Holds for experiments, control, IV, RD, and almost any other method which does not impose functional form restrictions on outcome distribution
- Exception is for DiD, due to fundamentally functional-form dependent nature of parallel trends assumption
- See Athey and Imbens (2006) for results under distinct but related assumptions which do allow identifying distributions
- \(G(Y,u)=1\{Y<u\}\), \(\mathcal{U}\) the support of \(Y\), identifies the CDF \(F_{Y^{x}}(u)\) of the counterfactual distribution
- From the CDF, can identify other representations of the distribution
- If distribution is absolutely continuous, density function is \(f_{Y^x}(u)=\frac{d}{du}F_{Y^{x}}(u)\)
- If \(Y\in\mathbb{R}\), inverse of distribution describes quantile function \(Q_{Y^x}(\tau)=\inf\{u\in\mathbb{R}: F_{Y^x}(u)\geq \tau\}\)
- \(G(Y,u)=(Y)^u\) for \(u=1\ldots\infty\) gives the moments of distribution (when these exist)
- Alternate choices give various transforms of distribution
- Complex exponentials \(e^{iu(.)}\) give the Characteristic Function or Fourier Transform
- Other choices give a few less common but sometimes effective summaries like Wavelet Transform or Kernel Mean Embedding (Singh, Xu, and Gretton (2021))
- If goal is counterfactual distributions, main challenge is not identification but estimation
Effects on Distributions vs Distributions of Effects
- By subtracting average potential outcomes, can obtain average treatment effect \(E[Y^x-Y^{x^\prime}]\) of change from \(x^\prime\) to \(x\)
- By subtracting averages of nonlinear transforms of potential outcomes, can measure effect on that transform
- \(F_{Y^{x}}(u)-F_{Y^{x^\prime}}(u)\) measures effect on probability of outcome being below \(u\)
- \(Q_{Y^x}(\tau)-Q_{Y^{x^\prime}}(\tau)\) is the Quantile treatment effect (at quantile \(\tau\))
- Measures the effect of change in \(x\) on \(\tau^{th}\) quantile of outcome
- E.g., for \(\tau=0.5\), this is the effect of treatment on median result
- Measures of this form describe the effect of treatments on distributions
- Aside from ATE, they do not describe features of the distribution of effects \(Y^x-Y^{x^\prime}\)
- Reason for this is the fundamental problem of causal inference
- Even in experiments, never observe both \(Y^x\) and \(Y^{x^\prime}\) for the same unit
- Aside from ATE, no other features of effect distribution are identified
- If we care about effect distributions, have 3 possible options
- Learn bounds on feature of interest
- Apply conditioning to learn averages within a subset
- Impose additional assumptions on structural causal model
What can we learn about effect distributions?
- Counterfactuals depend on joint distribution \(F(t,s)=Pr(Y^{x}<t,Y^{x^\prime}<s)\)
- Experiments allow us to learn marginal distributions \(F_x(t)=F(t,\infty)\), \(F_{x^\prime}(s)= F(\infty,s)dt\)
- What can we learn about a joint from its marginals? Fréchet-Hoeffding bounds provide one answer
- \(\max(F_x(t)+F_{x^\prime}(s)-1,0)\leq F(t,s)\leq \min(F_x(t),F_{x^\prime}(s))\) (form in Nelsen (2007) 2.5.1 or any other reference on copulas)
- Upper and lower bound correspond to deterministic increasing and decreasing matching of points in \(Y^x\) and \(Y^{x\prime}\) distributions respectively
- Applying this result to difference in potential outcome, obtain Makarov bounds on \(F_{Y^{x}-Y^{x^\prime}}(s)\)
- \(\underset{t}{\sup}(\max(F_{x}(t)-F_{x^\prime}(t-s),0))\leq F_{Y^{x}-Y^{x^\prime}}(s)\leq \underset{t}{\inf}(\min(F_{x}(t)-F_{x^\prime}(t-s)+1,1))\)
- These bounds are tight for CDF at any given point \(s\), though Firpo and Ridder (2019) provide stronger results for bounds over multiple points
- One can say substantially more if willing to impose functional form assumptions on \(f_Y(x,U)\), the structural function generating \(Y\)
- Linearity, additivity, or monotonicity, and restrictions on dimension can all (nearly) pin down distribution of treatment effects
- You should have substantive reasons to impose these assumptions if you are going to rely on them for identification and not just as approximations
- A few cases where functional form assumption of structural function may be reasonable
- Known source of heterogeneity due to structure of measurement instrument: e.g., validation using exact ground truth of error structure induced by your microscope/biomedical test/survey instrument
- Case where potential outcomes known to decision-maker, in which case preferences may ensure monotonicity, as in some economic models of choice (French and Taber (2011))
- Bounds can be strengthened using conditional distributions of counterfactuals
- Conditionals interesting in their own right and we will return to them later
Outcome Distribution Estimation
- Because identification results are based on same formulas with new outcome variables \(G(Y,u)\) in place of \(Y\), one can repurpose the same estimators
- The main difference is that conditional mean estimators for \(E[Y|X]\), \(E[Y|X,Z]\) should be replaced by ones which perform well across functions \(G(Y,u), u\in\mathcal{U}\)
- In the case \(G(Y,u)=1\{Y\leq u\}\), regression estimates for \(1\{Y\leq u\}\) across different values of \(u\) called distribution regression
- Because outcome is binary, can apply logit or probit \(P(Y\leq u |X)=\Phi(X^{\prime}\beta(u))\) or \(\frac{exp(X^{\prime}\beta(u))}{1+exp(X^{\prime}\beta(u))}\) or linear probability model
- Estimated by MLE/regression applied separately at each value of \(u\) so requires no special programs
- Uniform convergence across \(u\) and inference methods in Chernozhukov, Fernández-Val, and Melly (2013)
- Other choices of \(G(Y,u), u\in\mathcal{U}\), such as basis functions, can be estimated by similar procedure
- Variety of alternative conditional distribution estimators which are not defined as collection of conditional means
- Representation using quantiles needs new procedure: quantile regression
- Representations based on densities have different properties
- Kernel, projection, and likelihood based estimators available
- Generally harder to estimate, in terms of slower rates
- Statistical properties matter for use in formulas that use the estimates as one component of a subsequent formula
Quantile Regression
- Versatile, widely applied tool for modeling conditional quantiles \(Q_{Y|X}(\tau)\)
- Fact that enables use is that a quantile is elicitable using the check function (or pinball) loss \(\rho_{\tau}(y,f)=(y-f)(\tau-1\{y-f<0\})\)
- A feature \(j(F)\) of a distribution \(F\) is elicitable if there is some loss function \(\ell(,)\) such that the risk \(E_F[\ell(y,f)]\) for that loss function with respect to distribution \(F\) of a prediction rule \(f\) is (uniquely) minimized by \(f=j(F)\)
- Claim: Suppose \(y\) has density \(p(y)\): then \(Q_y(\tau) = \underset{f\in\mathbb{R}}{\arg\min}E[\rho_{\tau}(y,f)]\)
- Proof: Take FOC \(E[\rho_{\tau}(y,f)]=(\tau-1)\int_{-\infty}^{f}(y-f)dP(y)+\tau\int_{f}^{\infty}(y-f)dP(y)\) by Leibniz rule
- \(\frac{d}{df}E[\rho_{\tau}(y,f)]=(\tau-1)((f-f)p(f)-\int_{-\infty}^{f}dP(y))-\tau((f-f)p(f)+\int_{f}^{\infty}(y-f)dP(y))\)
- \(=(1-\tau)P(f)-\tau(1-P(f))=P(f)-\tau\)
- Setting \(\frac{d}{df}E[\rho_{\tau}(y,f)]=0\), obtain \(P(f)=\tau\), or \(f\) is the \(\tau^{th}\) quantile of the \(y\) distribution
- Without the existence of a density, subdifferential argument shows that set of minimizers contains the quantile
- Identical proof using conditional density shows that \(Q_{y|x}(\tau) = \underset{f(x)\in\mathbb{R}}{\arg\min}E[\rho_{\tau}(y,f(x))|X=x]\)
- Result is that check function loss can be used with any procedure that minimizes a loss function to find a predictor which approximates the conditional quantile
- Quantile regression (Koenker (2005), Koenker (2017)) applies empirical risk minimization with the check function loss over the class of linear functions \(f(x)=x^\prime\beta\)
- \(\widehat{\beta}(\tau)=\underset{\beta}{\arg\min}\frac{1}{n}\sum_{i=1}^{n}\rho_{\tau}(y_i,x_i^\prime\beta)\)
- Note that in special case of the median \(\tau=0.5\), \(\rho_{\tau}(y,f)=0.5|y-f|\) and so quantile regression is called Least Absolute Deviations (LAD) regression
- Sometimes used as alternative to OLS due to robustness to outliers
Properties of quantile regression estimator
- Variants of the standard extremum estimator arguments show that quantile regression is consistent and asymptotically normal for any \(\tau\in(0,1)\)
- Some complications due to nondifferentiability of the loss function
- By linearity, the optimization problem for linear quantile regression is linear program, so reasonably fast code exists
rq()
in library quantreg
with standard R formula syntax eg rq(y~z1+z2,tau=0.2)
- quantgen extends to L1-penalized Lasso version, which is also a linear program
- Major advantage of quantile regression is that by running it for different values of \(\tau\), can obtain \(\widehat{\beta}(\tau)\) for entire quantile function
- Obtain uniform convergence to a Gaussian process over quantiles \([\epsilon,1-\epsilon]\) (Chernozhukov, Fernández-Val, and Melly (2013), Belloni et al. (2017))
- The linear form is correctly specified if conditional distribution follows a location-scale model \(y_i=x_i^\prime\alpha+(x_i^\prime\gamma)u_i\), \(u_i\perp x_i\) \(u_i\overset{iid}{\sim}F(u)\)
- Then \(Q_{y_i|x_i}(\tau)=x_i^\prime\alpha+(x_i^\prime\gamma)Q_{u_i}(\tau)=x_i^\prime\beta(\tau)\) for \(\beta(\tau)=\alpha+\gamma Q_{u_i}(\tau)\)
- Result based on monotone invariance lemma: suppose \(h()\) is monotone: then \(Q_{h(y)}(\tau)= h(Q_{y}(\tau))\)
- Proof: \(Pr(y\leq Q_{y}(\tau))=\tau\) (def of quantile) so \(Pr(h(y)\leq h(Q_{y}(\tau)))=\tau\) (monotonicity) so \(h(Q_{y}(\tau))\) is the \(\tau\) quantile of \(h(y)\).
Challenges with quantile regression
- The need for a continuous density can result in optimization and inference issues when \(y\) discrete or widely spaced
- Results in slow or inaccurate optimization, and somewhat variable coefficients, as reflected in dependence of asymptotic variance on density
- Quantiles at values near 0 or 1 become very difficult to estimate due to small fraction of data in extremes
- If needing a proper function on entire space, may seek to extrapolate using a parametric model
- In case of misspecification, or even just due to sampling, estimated process may not preserve properties of a quantile function
- \(x^\prime\hat{\beta}(\tau)\) may fail to be increasing in \(\tau\) for some values of \(x\), leading curves \(x\to x^\prime\hat{\beta}(\tau)\) to not be strictly ordered with espect to \(\tau\): called quantile crossing
- While calculating the process at nearby \(\tau\) values may result in some crossing, its presence can be a sign of misspecification
- A simple fix, if needing, e.g., to invert quantile to get back a CDF, is monotone rearrangement (Chernozhukov, Fernández-Val, and Galichon (2010)): sort the values of \(x^\prime\hat{\beta}(\tau)\) from smallest to largest
- As the linear functional form can be restrictive, flexibility can be increased by including nonlinear terms in \(x\)
- As with mean regression, there is a tradeoff between complexity and ability to obtain fast rates of convergence and valid asymptotic inference
- Machine learning literature has applied check loss in high dimensional nd regularized settings including Lasso, neural nets, and so on
- Inference methods for coefficients based on normal approximations still require bounds on degree of complexity of x, restricting to low dimensional predictors
Conditional Density Estimation
- Unlike CDF or quantile estimation, where estimation of the whole function can be achieved at the same rate as a parametric model, density estimation at even a single point requires substantially more data
- Local methods, like kernels, approximate distributions by averaging (smoothed) measures of the fraction of points within a window
- Kernels conditional density estimation for \(f(y|x)\) takes the ratio of kernel density estimates of the joint density \(f(y,x)\) divided by an estimate of the marginal \(f(x)\)
- An estimator of \(f(y|x)\) of form \(\frac{\frac{1}{h_y}K(\frac{y-Y_i}{h_j})\Pi_{j=1}^{d_x}\frac{1}{h_j}K(\frac{x_j-X_{i,j}}{h_j})}{\Pi_{j=1}^{d_x}\frac{1}{h_j}K(\frac{x_j-X_{i,j}}{h_j})}\) achieves rate \(n^{-2/(d_x+5)}\) with optimal bandwidth (Hall, Racine, and Li (2004),
npcdens
in R library np
)
- Rate gets worse with number of predictors
- Alternate estimators use likelihood models for densities, which may be tailored to the application
- Lightly parameterized estimators may achieve high efficiency, at cost of high degree of fragility if misspecified
- Models where number of parameters is effectively unbounded or growing with the sample size, restore flexibility, at cost of generally retaining dimension-dependent rates
- Examples include mixture models, with growing numbers of components and exponential family models, with log density proportional to the sum of a set of coefficients times bases functions
- Bayesian nonparametric estimators based on these likelihoods, such as Dirichlet process mixture models and Gaussian process exponential family models, provide a way to work with and regularize such density models, but do not overcome the slow rate of convergence (Ghosal and Van der Vaart (2017))
Weighting method: Trees
- One of the most generically useful classes of methods for conditional mean and distribution estimation is Random Forests
- These are made by combining a large number of simpler estimators called (decision) trees
- A (decision) tree is a piecewise constant function, defined sequentially
- Pick some predictor \(j\in 1\ldots d_x\), split space in two regions, where \(x_{j,i}\) is above or below a threshold \(s_1\)
- Within each region \(x_{j,i}<s_1\), \(x_{j,i}\geq s_1\), pick another predictor in \(1\ldots d_x\), and another threshold \(s_2\), and split the region in two
- Continue choosing and splitting regions in two until space is split into \(J\) non-overlapping rectangular subregions \(\{R_j\}_{j=1}^{J}\) called “leaves”
- Choice of subsequent predictor and threshold by “greedy” algorithm: recursive binary splitting
- For each predictor \(x_{j,i}\) \(j=1\ldots d_x\), pick a threshold \(s\) to split into regions \(R_1=\{X:\ X_i<s\}\), \(R_2=\{X:\ X_i\geq s\}\) to minimize loss function based on differences in distribution of outcome in \(R_1\) and \(R_2\)
- Choose the split \(j\in1\ldots d_x\) with the smallest loss to define the chosen region
- While trees are usually used for predicting conditional means or probabilities, more generally useful in that they create a partition of a space, and so can define a data-adaptive measure of which points are “close” in a sense relevant to the outcome by assigning each point \(x\) to \(j(x)\), the leaf in which it is contained
- Final tree contains a collection of weights for any point in the space: \(w_i(x)=\frac{x_i\in R_{j(x)}}{|\{ i:\ x_i\in R_{j(x)}\}|}\)
- Estimate \(E[Y|X=x]\) by \(\sum_i w_i(x)Y_i\), the average value within a leaf
- If performing inference, it may be helpful to split the samples used to estimate the weights and the conditional means, giving rise to “honest” trees (Wager and Athey (2018), R library
grf
)
- Loss functions may be defined based on difference in means in case where that is a goal, or based on tests for identical distributions
Bagging
- Bagging, short for “bootstrap aggregation” is a way of performing regularization, by averaging a model with itself
- Idea is that model fit to one set of data can overfit, producing high variability
- If we had \(2^{nd}\) data set, could fit model on that data and average prediction with that of the first model
- Even though both overfit, they do so to different data, so shared component reflects systematic variation
- In practice, only have one data set, but can “pull yourself up by your own bootstraps” by resampling the data
- From original data set \(\mathcal{O}_{n}\) draw observations \((y_i,X_i)\) independently with replacement to create new data set \(\mathcal{O}_{b}\)
- Data set will on average have only about \(\frac{2}{3}\) of original observations, with rest as duplicates
- Fit any model \(\widehat{f}_b\) on data \(\mathcal{O}_{b}\), e.g., a decision tree by recursive partitions
- Then draw with replacement again to create another data set and fit the same model again to the new data set
- Repeat \(B\) times, and use as the final model the average of all the models on different data sets \(\mathcal{O}_b\)
- \(\widehat{f}^{\text{Bagged}}=\frac{1}{B}\sum_{b=1}^{B}\widehat{f}_b\)
- Note each predictor fit to part of data set, so don’t gain quite as much for each from large samples
- But because errors produced this way nonsystematic, may get substantial out of sample improvement, especially if using a complex model prone to overfitting like a deep decision tree
Random Forests
- Bagging can be applied to any procedure, but does well when fit model is a tree
- Large trees can incorporate info from many predictors, but variability can be high, which is helped by bagging
- Further, because average of trees doesn’t have to be a tree, bagging can also increase expressivity of model
- In practice, because data is shared between bags, tree fit within each tends to be pretty similar
- To increase expressivity of ensemble as a whole, force models to be different by only using a random subset of features
- In each bag, sample \(m<d_x\) random elements of \((x_{1,t},x_{2,t},\ldots,x_{d_x,t})\) and fit tree using only those features
- Typically use \(m\approx \frac{d_x}{3}\) for regression, \(m\approx \sqrt{d_x}\) for classification
- Since “top” features may be removed, forced to find best prediction with other features instead
- Each individual tree predicts less well, but with \(B\) large, combination accounts for all features by including them in the average
- Final predictor is based on random collection of trees, and so is called a Random Forest
- Weights \(w_i(x)\) in random forest are average over bags of weights in individual trees, and prediction of \(E[Y|X=x]\) is \(\sum_i w_i(x)Y_i\)
- A quantile random forest (Meinshausen and Ridgeway (2006), R library
quantregForest
) is random forest applied to \(1\{Y_i<u\}:\ u\in\mathcal{U}\)
- A distributional random forest (Ćevid et al. (2021), R library
drf
) computes the weights based on a test of identical distribution applicable to high dimensional data
- They can then be applied to means, kernels, and other outcomes
Adjustment
- Average conditional CDF \(F_{Y^x}(u)\) identifiable under conditional random assignment as \(\int E[1\{Y<u\}|X=x,Z]dP(z)\)
- When applying the adjustment formula result, one can apply averaged conditional mean estimation, inverse propensity weighting, or augmented IPW doubly robust approaches
- For distribution or quantile regression as conditional mean estimator, results for averaged conditional mean estimation are in Chernozhukov, Fernández-Val, and Melly (2013)
- IPW estimation proposed by DiNardo, Fortin, and Lemieux (1996), though without limit theory
- For AIPW approach, including cross-fitting to allow high dimensional predictors in the conditional mean or propensity estimates, uniformity results are provided in Belloni et al. (2017)
- Note that while distribution of \(Z\) averaged over, \(u\) is not, so the difference from mean estimation case is that estimation and inference are for a function, not just a scalar
- Statistical efficiency depends on complexity of function class \(\{G(Y,u): u\in\mathcal{U}\}\),
- This is still small, essentially parametric, for applications like CDF or quantiles of outcome distribution
- With averaged mean or IPW estimator, \(\sqrt{n}\) rates go through if nuisance estimators also parametric and low complexity
- If we use Neyman orthogonal cross-fit estimator like AIPW, avoid additional dependence on complexity of nuisance functions even if conditioning set is high dimensional and nuisances estimated by something like, e.g., quantile random forests or L1-penalized distribution regression
- Under reasonable conditions, AIPW estimate asymptotically Gaussian over the space of the process, and distributions cn be estimated by a bootstrap procedure
- Can test hypotheses about a function with Gaussian limit distribution by comparing maximum (Kolmogorov-Smirnov) or average (Cramer-von Mises) difference to null hypothesis function.
- Where this standard limit behavior is not the case is when we estimate outcome by density instead of CDF or quantiles, as density is not estimable at parametric rates
- DiNardo, Fortin, and Lemieux (1996) applies IPW to kernel density estimate of outcome (wage) distribution
- Kennedy, Balakrishnan, and Wasserman (2021) estimate finite dimensional projections of densities, which can be made robust since essentially parametric
- Precisely what can be said about when averaged density estimation can do better than plug-in still appears to be an open question
Data Example: Wage Distribution Response to Labor Supply Shock
- Breza, Kaur, and Shamdasani (2021) experimentally vary (residual) labor supply in 60 villages by randomizing temporary hiring of large number of workers
- Average wage of remaining workers rises if shock is in peak agricultural season, but not substantially in off-peak season, a fact the authors attribute to surplus labor (per Arthur Lewis)
- Heterogeneity is of interest: how does wage distribution respond to shock?
- Simplest approach: conditional empirical CDF (valid with discrete treatment and predictors only)
#Libraries
library(quantreg) #quantile regression
library(quantregForest) #quantile random forests
library(ggplot2) #Plots
library(drf) #distributional random forests
library(Counterfactual) #Conditional counterfactual distributions
library(npcausal) #Conditional counterfactual distributions
library(haven) #Stata data
library(dplyr) #data manipulation
#Data set from Breza Kaur and Shamdasani (2021)
#In my case this is saved at
rationing<-read_dta("Data/rationing_worker.dta")
spilloversample<-dplyr::filter(rationing,surv_el==1 & sample_spillover==1 & e_manwage==1)
spilloversample$treated<-factor(spilloversample$treat)
spilloversample$peakseason<-factor(spilloversample$peak, labels=c("Off-Peak Season","Peak Season"))
ggplot(spilloversample)+facet_wrap(~peakseason)+
stat_ecdf(aes(x=dailywagetot,color=treated))+
ggtitle("Wage Distribution Response to Labor Supply Shock: Conditional CDF",
subtitle="Breza, Kaur, and Shamdasani (2021)")+
ylab("Fraction of Workers")+xlab("Daily Total Wage (Rupees)")
- Alternately represent by conditional kernel density, which applies smoothing and may improve interpretation, but is less precisely estimated due to slower rate of density vs CDF estimation
ggplot(spilloversample)+facet_wrap(~peakseason)+
geom_density(aes(x=dailywagetot,fill=treated,color=treated),alpha=0.5)+
ggtitle("Wage Distribution Response to Labor Supply Shock: Conditional Kernel Density",
subtitle="Breza, Kaur, and Shamdasani (2021)")+
ylab("Probability Density of Workers")+xlab("Daily Total Wage (Rupees)")
