- Most tools described so far estimate a low dimensional summary of response to intervention
- Typically, an average potential outcome \(E[Y^x]\) (in our usual notation)

- Running theme of identification results has been avoidance of assumptions of homogeneity of effects
- Move past linear additive model, which is typically justified only by convenience

- If this is a concern, it would be nice to measure this heterogeneity
- How much do outcomes vary and how are they distributed?
- Can we predict this variation in the outcomes?

- These two tasks lead to two estimation challenges
- Outcome distribution modeling
- Conditional treatment effect estimation

- Heterogeneity matters to extent we care about features beyond average
- Preferences over inequality across individuals or risk over outcomes may make distribution inherently interesting
- Through aggregation, it may also impact group-level outcomes

- Prediction of effect heterogeneity may be used in interpretation, exploration, and decision-making
- Fundamentally a mix of causal and description or prediction problems
- Arguably, main use is as input into one of several possible fully causal problems, so applicability depends on intended use
- Different targets if goal is multivariate causal effect, specification of causal pathways, or policy choice

- Almost every method we have described for identifying an average potential outcome can be trivially converted into a method for identifying the distribution of potential outcomes
- Replace \(Y\) with \(G(Y,u)\) for collection of functions \(\{G(.,u)\ u\in\mathcal{U}\}\)
- Identification result for \(E[Y^x]\) also gives value for \(E[G(Y^x,u)]\) for any \(u\in\mathcal{U}\) if expectation is finite
- Holds for experiments, control, IV, RD, and almost any other method which does not impose functional form restrictions on outcome distribution
- Exception is for DiD, due to fundamentally functional-form dependent nature of parallel trends assumption
- See Athey and Imbens (2006) for results under distinct but related assumptions which do allow identifying distributions

- \(G(Y,u)=1\{Y<u\}\), \(\mathcal{U}\) the support of \(Y\), identifies the CDF \(F_{Y^{x}}(u)\) of the counterfactual distribution
- From the CDF, can identify other representations of the distribution
- If distribution is absolutely continuous,
**density function**is \(f_{Y^x}(u)=\frac{d}{du}F_{Y^{x}}(u)\) - If \(Y\in\mathbb{R}\), inverse of distribution describes
**quantile function**\(Q_{Y^x}(\tau)=\inf\{u\in\mathbb{R}: F_{Y^x}(u)\geq \tau\}\)

- If distribution is absolutely continuous,
- \(G(Y,u)=(Y)^u\) for \(u=1\ldots\infty\) gives the
**moments**of distribution (when these exist) - Alternate choices give various transforms of distribution
- Complex exponentials \(e^{iu(.)}\) give the Characteristic Function or Fourier Transform
- Other choices give a few less common but sometimes effective summaries like Wavelet Transform or Kernel Mean Embedding (Singh, Xu, and Gretton (2021))

- If goal is counterfactual distributions, main challenge is not identification but estimation

- By subtracting average potential outcomes, can obtain average treatment effect \(E[Y^x-Y^{x^\prime}]\) of change from \(x^\prime\) to \(x\)
- By subtracting averages of nonlinear transforms of potential outcomes, can measure effect on that transform
- \(F_{Y^{x}}(u)-F_{Y^{x^\prime}}(u)\) measures effect on probability of outcome being below \(u\)
- \(Q_{Y^x}(\tau)-Q_{Y^{x^\prime}}(\tau)\) is the
**Quantile treatment effect**(at quantile \(\tau\))- Measures the effect of change in \(x\) on \(\tau^{th}\) quantile of outcome
- E.g., for \(\tau=0.5\), this is the effect of treatment on median result

- Measures of this form describe the effect of treatments on distributions
- Aside from ATE, they do not describe features of the distribution of effects \(Y^x-Y^{x^\prime}\)
- Reason for this is the fundamental problem of causal inference
- Even in experiments, never observe both \(Y^x\) and \(Y^{x^\prime}\) for the same unit

- Aside from ATE,
**no**other features of effect distribution are identified

- Even in experiments, never observe both \(Y^x\) and \(Y^{x^\prime}\) for the same unit
- If we care about effect distributions, have 3 possible options
- Learn
**bounds**on feature of interest - Apply conditioning to learn averages within a subset
- Impose additional assumptions on structural causal model

- Learn

- Counterfactuals depend on joint distribution \(F(t,s)=Pr(Y^{x}<t,Y^{x^\prime}<s)\)
- Experiments allow us to learn marginal distributions \(F_x(t)=F(t,\infty)\), \(F_{x^\prime}(s)= F(\infty,s)dt\)
- What can we learn about a joint from its marginals?
**Fréchet-Hoeffding**bounds provide one answer- \(\max(F_x(t)+F_{x^\prime}(s)-1,0)\leq F(t,s)\leq \min(F_x(t),F_{x^\prime}(s))\) (form in Nelsen (2007) 2.5.1 or any other reference on copulas)

- Upper and lower bound correspond to deterministic increasing and decreasing matching of points in \(Y^x\) and \(Y^{x\prime}\) distributions respectively
- Applying this result to
*difference*in potential outcome, obtain Makarov bounds on \(F_{Y^{x}-Y^{x^\prime}}(s)\)- \(\underset{t}{\sup}(\max(F_{x}(t)-F_{x^\prime}(t-s),0))\leq F_{Y^{x}-Y^{x^\prime}}(s)\leq \underset{t}{\inf}(\min(F_{x}(t)-F_{x^\prime}(t-s)+1,1))\)
- These bounds are tight for CDF at any given point \(s\), though Firpo and Ridder (2019) provide stronger results for bounds over multiple points

- One can say substantially more if willing to impose functional form assumptions on \(f_Y(x,U)\), the structural function generating \(Y\)
- Linearity, additivity, or monotonicity, and restrictions on dimension can all (nearly) pin down distribution of treatment effects
- You should have substantive reasons to impose these assumptions if you are going to rely on them for identification and not just as approximations

- A few cases where functional form assumption of structural function may be reasonable
- Known source of heterogeneity due to structure of measurement instrument: e.g., validation using exact ground truth of error structure induced by your microscope/biomedical test/survey instrument
- Case where potential outcomes known to decision-maker, in which case preferences may ensure monotonicity, as in some economic models of choice (French and Taber (2011))

- Bounds can be strengthened using conditional distributions of counterfactuals
- Conditionals interesting in their own right and we will return to them later

- Because identification results are based on same formulas with new outcome variables \(G(Y,u)\) in place of \(Y\), one can repurpose the same estimators
- The main difference is that conditional mean estimators for \(E[Y|X]\), \(E[Y|X,Z]\) should be replaced by ones which perform well across functions \(G(Y,u), u\in\mathcal{U}\)
- In the case \(G(Y,u)=1\{Y\leq u\}\), regression estimates for \(1\{Y\leq u\}\) across different values of \(u\) called
**distribution regression**- Because outcome is binary, can apply logit or probit \(P(Y\leq u |X)=\Phi(X^{\prime}\beta(u))\) or \(\frac{exp(X^{\prime}\beta(u))}{1+exp(X^{\prime}\beta(u))}\) or linear probability model
- Estimated by MLE/regression applied separately at each value of \(u\) so requires no special programs
- Uniform convergence across \(u\) and inference methods in Chernozhukov, Fernández-Val, and Melly (2013)

- Other choices of \(G(Y,u), u\in\mathcal{U}\), such as basis functions, can be estimated by similar procedure

- Variety of alternative conditional distribution estimators which are not defined as collection of conditional means
- Representation using quantiles needs new procedure: quantile regression
- Representations based on densities have different properties
- Kernel, projection, and likelihood based estimators available
- Generally harder to estimate, in terms of slower rates

- Statistical properties matter for use in formulas that use the estimates as one component of a subsequent formula

- Versatile, widely applied tool for modeling conditional quantiles \(Q_{Y|X}(\tau)\)
- Fact that enables use is that a quantile is elicitable using the
**check function**(or**pinball**) loss \(\rho_{\tau}(y,f)=(y-f)(\tau-1\{y-f<0\})\)- A feature \(j(F)\) of a distribution \(F\) is
*elicitable*if there is some loss function \(\ell(,)\) such that the risk \(E_F[\ell(y,f)]\) for that loss function with respect to distribution \(F\) of a prediction rule \(f\) is (uniquely) minimized by \(f=j(F)\)

- A feature \(j(F)\) of a distribution \(F\) is
**Claim**: Suppose \(y\) has density \(p(y)\): then \(Q_y(\tau) = \underset{f\in\mathbb{R}}{\arg\min}E[\rho_{\tau}(y,f)]\)**Proof**: Take FOC \(E[\rho_{\tau}(y,f)]=(\tau-1)\int_{-\infty}^{f}(y-f)dP(y)+\tau\int_{f}^{\infty}(y-f)dP(y)\) by Leibniz rule- \(\frac{d}{df}E[\rho_{\tau}(y,f)]=(\tau-1)((f-f)p(f)-\int_{-\infty}^{f}dP(y))-\tau((f-f)p(f)+\int_{f}^{\infty}(y-f)dP(y))\)
- \(=(1-\tau)P(f)-\tau(1-P(f))=P(f)-\tau\)
- Setting \(\frac{d}{df}E[\rho_{\tau}(y,f)]=0\), obtain \(P(f)=\tau\), or \(f\) is the \(\tau^{th}\) quantile of the \(y\) distribution

- Without the existence of a density, subdifferential argument shows that
*set*of minimizers contains the quantile - Identical proof using conditional density shows that \(Q_{y|x}(\tau) = \underset{f(x)\in\mathbb{R}}{\arg\min}E[\rho_{\tau}(y,f(x))|X=x]\)
- Result is that check function loss can be used with any procedure that minimizes a loss function to find a predictor which approximates the conditional quantile
**Quantile regression**(Koenker (2005), Koenker (2017)) applies empirical risk minimization with the check function loss over the class of linear functions \(f(x)=x^\prime\beta\)- \(\widehat{\beta}(\tau)=\underset{\beta}{\arg\min}\frac{1}{n}\sum_{i=1}^{n}\rho_{\tau}(y_i,x_i^\prime\beta)\)

- Note that in special case of the median \(\tau=0.5\), \(\rho_{\tau}(y,f)=0.5|y-f|\) and so quantile regression is called
**Least Absolute Deviations**(LAD) regression- Sometimes used as alternative to OLS due to robustness to outliers

- Variants of the standard extremum estimator arguments show that quantile regression is consistent and asymptotically normal for any \(\tau\in(0,1)\)
- Some complications due to nondifferentiability of the loss function

- By linearity, the optimization problem for linear quantile regression is linear program, so reasonably fast code exists
`rq()`

in library`quantreg`

with standard R formula syntax eg`rq(y~z1+z2,tau=0.2)`

- quantgen extends to L1-penalized Lasso version, which is also a linear program

- Major advantage of quantile regression is that by running it for different values of \(\tau\), can obtain \(\widehat{\beta}(\tau)\) for entire
*quantile function*- Obtain uniform convergence to a Gaussian process over quantiles \([\epsilon,1-\epsilon]\) (Chernozhukov, Fernández-Val, and Melly (2013), Belloni et al. (2017))

- The linear form is correctly specified if conditional distribution follows a location-scale model \(y_i=x_i^\prime\alpha+(x_i^\prime\gamma)u_i\), \(u_i\perp x_i\) \(u_i\overset{iid}{\sim}F(u)\)
- Then \(Q_{y_i|x_i}(\tau)=x_i^\prime\alpha+(x_i^\prime\gamma)Q_{u_i}(\tau)=x_i^\prime\beta(\tau)\) for \(\beta(\tau)=\alpha+\gamma Q_{u_i}(\tau)\)

- Result based on
*monotone invariance*lemma: suppose \(h()\) is monotone: then \(Q_{h(y)}(\tau)= h(Q_{y}(\tau))\)**Proof**: \(Pr(y\leq Q_{y}(\tau))=\tau\) (def of quantile) so \(Pr(h(y)\leq h(Q_{y}(\tau)))=\tau\) (monotonicity) so \(h(Q_{y}(\tau))\) is the \(\tau\) quantile of \(h(y)\).

- The need for a continuous density can result in optimization and inference issues when \(y\) discrete or widely spaced
- Results in slow or inaccurate optimization, and somewhat variable coefficients, as reflected in dependence of asymptotic variance on density

- Quantiles at values near 0 or 1 become very difficult to estimate due to small fraction of data in extremes
- If needing a proper function on entire space, may seek to extrapolate using a parametric model

- If needing a proper function on entire space, may seek to extrapolate using a parametric model
- In case of misspecification, or even just due to sampling, estimated process may not preserve properties of a quantile function
- \(x^\prime\hat{\beta}(\tau)\) may fail to be increasing in \(\tau\) for some values of \(x\), leading curves \(x\to x^\prime\hat{\beta}(\tau)\) to not be strictly ordered with espect to \(\tau\): called
**quantile crossing** - While calculating the process at nearby \(\tau\) values may result in some crossing, its presence can be a sign of misspecification
- A simple fix, if needing, e.g., to invert quantile to get back a CDF, is
**monotone rearrangement**(Chernozhukov, Fernández-Val, and Galichon (2010)): sort the values of \(x^\prime\hat{\beta}(\tau)\) from smallest to largest

- \(x^\prime\hat{\beta}(\tau)\) may fail to be increasing in \(\tau\) for some values of \(x\), leading curves \(x\to x^\prime\hat{\beta}(\tau)\) to not be strictly ordered with espect to \(\tau\): called
- As the linear functional form can be restrictive, flexibility can be increased by including nonlinear terms in \(x\)
- As with mean regression, there is a tradeoff between complexity and ability to obtain fast rates of convergence and valid asymptotic inference
- Machine learning literature has applied check loss in high dimensional nd regularized settings including Lasso, neural nets, and so on
- Inference methods for coefficients based on normal approximations still require bounds on degree of complexity of x, restricting to low dimensional predictors

- Unlike CDF or quantile estimation, where estimation of the whole function can be achieved at the same rate as a parametric model, density estimation at even a single point requires substantially more data
- Local methods, like kernels, approximate distributions by averaging (smoothed) measures of the fraction of points within a window
- Kernels conditional density estimation for \(f(y|x)\) takes the ratio of kernel density estimates of the joint density \(f(y,x)\) divided by an estimate of the marginal \(f(x)\)
- An estimator of \(f(y|x)\) of form \(\frac{\frac{1}{h_y}K(\frac{y-Y_i}{h_j})\Pi_{j=1}^{d_x}\frac{1}{h_j}K(\frac{x_j-X_{i,j}}{h_j})}{\Pi_{j=1}^{d_x}\frac{1}{h_j}K(\frac{x_j-X_{i,j}}{h_j})}\) achieves rate \(n^{-2/(d_x+5)}\) with optimal bandwidth (Hall, Racine, and Li (2004),
`npcdens`

in R library`np`

) - Rate gets worse with number of predictors

- An estimator of \(f(y|x)\) of form \(\frac{\frac{1}{h_y}K(\frac{y-Y_i}{h_j})\Pi_{j=1}^{d_x}\frac{1}{h_j}K(\frac{x_j-X_{i,j}}{h_j})}{\Pi_{j=1}^{d_x}\frac{1}{h_j}K(\frac{x_j-X_{i,j}}{h_j})}\) achieves rate \(n^{-2/(d_x+5)}\) with optimal bandwidth (Hall, Racine, and Li (2004),
- Alternate estimators use likelihood models for densities, which may be tailored to the application
- Lightly parameterized estimators may achieve high efficiency, at cost of high degree of fragility if misspecified
- Models where number of parameters is effectively unbounded or growing with the sample size, restore flexibility, at cost of generally retaining dimension-dependent rates
- Examples include mixture models, with growing numbers of components and exponential family models, with log density proportional to the sum of a set of coefficients times bases functions
- Bayesian nonparametric estimators based on these likelihoods, such as Dirichlet process mixture models and Gaussian process exponential family models, provide a way to work with and regularize such density models, but do not overcome the slow rate of convergence (Ghosal and Van der Vaart (2017))

- One of the most generically useful classes of methods for conditional mean and distribution estimation is
**Random Forests**- These are made by combining a large number of simpler estimators called (decision) trees

- A (decision)
**tree**is a piecewise constant function, defined sequentially- Pick some predictor \(j\in 1\ldots d_x\), split space in two regions, where \(x_{j,i}\) is above or below a threshold \(s_1\)
- Within each region \(x_{j,i}<s_1\), \(x_{j,i}\geq s_1\), pick another predictor in \(1\ldots d_x\), and another threshold \(s_2\), and split the region in two
- Continue choosing and splitting regions in two until space is split into \(J\) non-overlapping rectangular subregions \(\{R_j\}_{j=1}^{J}\) called “leaves”

- Choice of subsequent predictor and threshold by “greedy” algorithm:
**recursive binary splitting**- For each predictor \(x_{j,i}\) \(j=1\ldots d_x\), pick a threshold \(s\) to split into regions \(R_1=\{X:\ X_i<s\}\), \(R_2=\{X:\ X_i\geq s\}\) to minimize loss function based on differences in distribution of outcome in \(R_1\) and \(R_2\)
- Choose the split \(j\in1\ldots d_x\) with the smallest loss to define the chosen region

- While trees are usually used for predicting conditional means or probabilities, more generally useful in that they create a
*partition*of a space, and so can define a data-adaptive measure of which points are “close” in a sense relevant to the outcome by assigning each point \(x\) to \(j(x)\), the leaf in which it is contained- Final tree contains a collection of weights for any point in the space: \(w_i(x)=\frac{x_i\in R_{j(x)}}{|\{ i:\ x_i\in R_{j(x)}\}|}\)
- Estimate \(E[Y|X=x]\) by \(\sum_i w_i(x)Y_i\), the average value within a leaf

- If performing inference, it may be helpful to split the samples used to estimate the weights and the conditional means, giving rise to “honest” trees (Wager and Athey (2018), R library
`grf`

)

- Loss functions may be defined based on difference in means in case where that is a goal, or based on tests for identical distributions

- Bagging, short for “bootstrap aggregation” is a way of performing regularization, by
*averaging a model with itself* - Idea is that model fit to one set of data can overfit, producing high variability
- If we had \(2^{nd}\) data set, could fit model on that data and average prediction with that of the first model
- Even though both overfit, they do so to different data, so shared component reflects systematic variation

- In practice, only have one data set, but can “pull yourself up by your own bootstraps” by
**resampling**the data - From original data set \(\mathcal{O}_{n}\) draw observations \((y_i,X_i)\) independently
*with replacement*to create new data set \(\mathcal{O}_{b}\)- Data set will on average have only about \(\frac{2}{3}\) of original observations, with rest as duplicates

- Data set will on average have only about \(\frac{2}{3}\) of original observations, with rest as duplicates
- Fit any model \(\widehat{f}_b\) on data \(\mathcal{O}_{b}\), e.g., a decision tree by recursive partitions
- Then draw with replacement again to create another data set and fit the same model again to the new data set
- Repeat \(B\) times, and use as the final model the average of all the models on different data sets \(\mathcal{O}_b\)
- \(\widehat{f}^{\text{Bagged}}=\frac{1}{B}\sum_{b=1}^{B}\widehat{f}_b\)

- Note each predictor fit to part of data set, so don’t gain quite as much for each from large samples
- But because errors produced this way nonsystematic, may get substantial out of sample improvement, especially if using a complex model prone to overfitting like a deep decision tree

- Bagging can be applied to any procedure, but does well when fit model is a tree
- Large trees can incorporate info from many predictors, but variability can be high, which is helped by bagging
- Further, because average of trees doesn’t have to be a tree, bagging can also increase expressivity of model
- In practice, because data is shared between bags, tree fit within each tends to be pretty similar
- To increase expressivity of ensemble as a whole, force models to be different by only using a
*random subset of features*- In each bag, sample \(m<d_x\) random elements of \((x_{1,t},x_{2,t},\ldots,x_{d_x,t})\) and fit tree using only those features
- Typically use \(m\approx \frac{d_x}{3}\) for regression, \(m\approx \sqrt{d_x}\) for classification

- Since “top” features may be removed, forced to find best prediction with other features instead
- Each individual tree predicts less well, but with \(B\) large,
**combination**accounts for all features by including them in the average

- Each individual tree predicts less well, but with \(B\) large,
- Final predictor is based on random collection of trees, and so is called a
**Random Forest** - Weights \(w_i(x)\) in random forest are average over bags of weights in individual trees, and prediction of \(E[Y|X=x]\) is \(\sum_i w_i(x)Y_i\)
- A
**quantile random forest**(Meinshausen and Ridgeway (2006), R library`quantregForest`

) is random forest applied to \(1\{Y_i<u\}:\ u\in\mathcal{U}\) - A
**distributional random forest**(Ćevid et al. (2021), R library`drf`

) computes the weights based on a test of identical distribution applicable to high dimensional data- They can then be applied to means, kernels, and other outcomes

- Average conditional CDF \(F_{Y^x}(u)\) identifiable under conditional random assignment as \(\int E[1\{Y<u\}|X=x,Z]dP(z)\)

- When applying the adjustment formula result, one can apply averaged conditional mean estimation, inverse propensity weighting, or augmented IPW doubly robust approaches
- For distribution or quantile regression as conditional mean estimator, results for averaged conditional mean estimation are in Chernozhukov, Fernández-Val, and Melly (2013)
- IPW estimation proposed by DiNardo, Fortin, and Lemieux (1996), though without limit theory
- For AIPW approach, including cross-fitting to allow high dimensional predictors in the conditional mean or propensity estimates, uniformity results are provided in Belloni et al. (2017)

- Note that while distribution of \(Z\) averaged over, \(u\) is not, so the difference from mean estimation case is that estimation and inference are for a function, not just a scalar
- Statistical efficiency depends on complexity of function class \(\{G(Y,u): u\in\mathcal{U}\}\),
- This is still small, essentially parametric, for applications like CDF or quantiles of outcome distribution
- With averaged mean or IPW estimator, \(\sqrt{n}\) rates go through if nuisance estimators also parametric and low complexity
- If we use Neyman orthogonal cross-fit estimator like AIPW, avoid
*additional*dependence on complexity of nuisance functions even if conditioning set is high dimensional and nuisances estimated by something like, e.g., quantile random forests or L1-penalized distribution regression

- Under reasonable conditions, AIPW estimate asymptotically Gaussian over the space of the process, and distributions cn be estimated by a bootstrap procedure
- Can test hypotheses about a function with Gaussian limit distribution by comparing maximum (Kolmogorov-Smirnov) or average (Cramer-von Mises) difference to null hypothesis function.

- Where this standard limit behavior is not the case is when we estimate outcome by density instead of CDF or quantiles, as density is not estimable at parametric rates
- DiNardo, Fortin, and Lemieux (1996) applies IPW to kernel density estimate of outcome (wage) distribution
- Kennedy, Balakrishnan, and Wasserman (2021) estimate finite dimensional projections of densities, which can be made robust since essentially parametric

- Precisely what can be said about when averaged density estimation can do better than plug-in still appears to be an open question

- Breza, Kaur, and Shamdasani (2021) experimentally vary (residual) labor supply in 60 villages by randomizing temporary hiring of large number of workers
- Average wage of remaining workers rises if shock is in peak agricultural season, but not substantially in off-peak season, a fact the authors attribute to surplus labor (per Arthur Lewis)
- Heterogeneity is of interest: how does wage distribution respond to shock?
- Simplest approach: conditional empirical CDF (valid with discrete treatment and predictors only)

```
#Libraries
library(quantreg) #quantile regression
library(quantregForest) #quantile random forests
library(ggplot2) #Plots
library(drf) #distributional random forests
library(Counterfactual) #Conditional counterfactual distributions
library(npcausal) #Conditional counterfactual distributions
library(haven) #Stata data
library(dplyr) #data manipulation
#Data set from Breza Kaur and Shamdasani (2021)
#In my case this is saved at
rationing<-read_dta("Data/rationing_worker.dta")
spilloversample<-dplyr::filter(rationing,surv_el==1 & sample_spillover==1 & e_manwage==1)
spilloversample$treated<-factor(spilloversample$treat)
spilloversample$peakseason<-factor(spilloversample$peak, labels=c("Off-Peak Season","Peak Season"))
ggplot(spilloversample)+facet_wrap(~peakseason)+
stat_ecdf(aes(x=dailywagetot,color=treated))+
ggtitle("Wage Distribution Response to Labor Supply Shock: Conditional CDF",
subtitle="Breza, Kaur, and Shamdasani (2021)")+
ylab("Fraction of Workers")+xlab("Daily Total Wage (Rupees)")
```

- Alternately represent by conditional kernel density, which applies smoothing and may improve interpretation, but is less precisely estimated due to slower rate of density vs CDF estimation

```
ggplot(spilloversample)+facet_wrap(~peakseason)+
geom_density(aes(x=dailywagetot,fill=treated,color=treated),alpha=0.5)+
ggtitle("Wage Distribution Response to Labor Supply Shock: Conditional Kernel Density",
subtitle="Breza, Kaur, and Shamdasani (2021)")+
ylab("Probability Density of Workers")+xlab("Daily Total Wage (Rupees)")
```