- An experiment is
*the*paradigmatic example of a manipulation that will be described as “causal”- The benchmark via which we define and assess causality in other settings

- Corresponds to deliberate manipulation of some variable in a system to set it to a value
- We call the manipulated variable the
**treatment**: I will denote it \(X\)

- We call the manipulated variable the
- Goal is to measure results in some other variable(s) that result from this manipulation
- Call measured variables the
**outcome**or**response**: I will denote it \(Y\)

- Call measured variables the
- System will often have other variables measured before assigning the treatment that correspond to attributes of the units
- Call them (baseline or pre-treatment)
**covariates**or**controls**: I will denote them \(Z\) - Precise meaning clearer in structural equation model: “before” means not a descendant of \(X\) (no directed path from \(X\) to \(Z\))

- Call them (baseline or pre-treatment)
- Some authors replace \((Y,X,Z)\) with \((Y,A,X)\) or \((Y,W,X)\) (or \(T\) or \(D\) or \(A\) or \((Y,A,L)\) or … )
- In an experiment, \(X\) determined by an
**assignment mechanism**or**design**- Function which replaces \(f_x()\) mechanism that determines \(X\) without deliberate manipulation
- Typically containing some source of (exogenous) random variation, in which case it is a “randomized experiment”

- Typical target in experiment is a causal effect of the treatment: a functional of the potential outcomes distributions
- In a model defined by (finite or infinite dimensional) parameters \(\Theta\) (generically, the structural functions \(f_1,\ldots,f_p\) and the distribution of the noise \(F_{U_1,\ldots,U_p}\)), which in turn indexes the set of distributions of all observed and counterfactual quantities, \(\mathcal{P}_\Theta:=\{P_\theta(Y,\{Y^x\}_{x\in\mathcal{X}},X,Z):\ \theta\in\Theta\}\) a
**functional**is a map \(\psi(.):\mathcal{P}_\Theta\to\mathbb{R}\)- We will drop reference to \(\theta\) when \(\theta=\theta_0\), the “true” structure assumed to generate the model, and to which distribution we are using when clear from context (and often when not clear)

- Example:
**Average Treatment Effect**(ATE) (at some values of \(X\))- \(E[Y^{X=x_1}-Y^{X=x_0}]\) if \(X\) set to discrete values. If there are only 2, w.l.o.g. we set \(x_1=1\), \(x_0=0\)
- \(\frac{d}{dx}E[Y^{X=x}]\) if \(X\) continuous

- Sometimes we are interested in conditioning on a subgroup, e.g.
- \(E[Y^{X=x_1}-Y^{X=x_0}|Z=z]\) or \(\frac{d}{dx}E[Y^{X=x}|Z=z]\) is the
**Conditional Average Treatment Effect**(CATE) at \(z\)

- \(E[Y^{X=x_1}-Y^{X=x_0}|Z=z]\) or \(\frac{d}{dx}E[Y^{X=x}|Z=z]\) is the
- Features other than mean: \(Pr(Y^x<y)=E[1\{Y^x<z\}]\) CDF of potential outcomes, quantiles, higher moments, etc
- These features generally do not fully describe entire structural model
- If \(Y_1 = f_1(Y_2,Y_3,Y_4,...,U_1)\), ATE given \(do(Y_2=x)\) is \(\frac{d}{dx}E[f_1(x,Y_3^x,Y_4^x,...,U_1)]\)

- We say a feature \(\psi\) of a model is
**identified**if for any \(\theta,\theta^\prime\) such that \(\psi(\theta^\prime)\neq\psi(\theta)\) the marginal distributions of*observed*variables (here \((Y,A,Z)\)) \(P_\theta(Y,A,Z)\) and \(P_{\theta^\prime}(Y,A,Z)\) are distinct- There then exists a map \(\psi(P(Y,A,Z))\) which is one to one, at which point we can omit reference to \(\theta\) entirely, as is standard in the literature

- In SCM, if \(X=f(U_x)\) with \(U_x\) independent of all other noise terms \(P(Y|do(X=x))=P(Y|X=x)\) for any \(x\) in support of distribution
- Similarly \(P(Y|do(X=x),Z=z)=P(Y|X=x,Z=z)\) in joint support of \(X,Z\) (
*Not*\(P(Y|do(X=x,Z=z))\): CATE is not a causal effect of \(Z\))

- Similarly \(P(Y|do(X=x),Z=z)=P(Y|X=x,Z=z)\) in joint support of \(X,Z\) (
- In potential outcomes notation, random assignment condition is \(Y^{X=x}\perp X\) for all \(x\) in support of distribution
- \(E[g(Y)|X=x]=E[g(Y^x)|X=x]=E[g(Y^x)]\) by independence, for any function \(g\)
- By applying to class of \(g\) generating distribution, learn entire distribution

- Experiments don’t identify:
- Features of the joint distribution of \(Y^{x_1},Y^{x_0}\)
- Includes counterfactuals which condition potential outcomes on other potential outcomes

- Individual-level effects, eg
**Individual Treatment Effect**\(Y^1_i-Y^0_i\) for unit \(i\)- May learn CATE for narrow subgroup, but observationally identical individuals may have heterogeneous ITEs

- Features of the joint distribution of \(Y^{x_1},Y^{x_0}\)
- Identification concept itself is proxy for estimability, though these are not always the same
- Distribution of observables can be approximated, often arbitrarily well in the limit of a large random sample
- Observed sample may or may not be large or random or have obvious sense in which limit can be taken
- Further technical concerns about continuity with respect to approximation can come up (hold this thought until instrumental variables lecture)

- Saw before that \(E[Y|X=1]-E[Y|X=0]\) \(=E[Y^1-Y^0|X=1]+E[Y^0|X=1]-E[Y^0|X=0]\) ATT + Selection
- Could, e.g., consider plausible size of selection effect and report results for each. “Sensitivity analysis”
- Reasonable if you have some sense of reasons for deviation from random assignment

- Could write down an economically plausible model, with agent decisions, all the mechanisms you can think of, with standard functional forms, unknown parameters estimated from data
- Suppose that given the structure of this model, estimated with data on observables, there is a unique value of treatment effect
- Then the model must restrict some features of unobservable potential outcomes
- How justifiable are those restrictions? Answer can’t be based on data alone

- Could someone modify the model and change the result without changing the data distribution?
- Common case: add unobserved common causes to any case where there is a directed arrow
- \(X\) causes \(Y\) or \(X\) and \(C\) cause \(Y\) and \(C\) causes \(X\). \(C\) is a “
*confounder*”

- When can we rule out this kind of confounding?
- \(X\) determined by known and sufficiently rigid institutional rule (Maybe encoded in software, regulation, or law)
- Physical restrictions on influence of unknown variables
- Strong behavioral invariance justified by theory and intervention

- All of above are strong, and not ensured by “I don’t know what else would be there”
- Need to be sure rule is in fact followed to the letter
- Theoretical invariances surprisingly often contradicted (e.g. Modligliani-Miller on debt vs equity, standard tax incidence theory on nominal tax liability,…)

Suppose we know nothing about assignment mechanism: what can we say?

Suppose all we know is \(Y^x\in[0,1]\) (w.l.o.g.: with bounds [\(\underline{Y}\), \(\bar{Y}\)] replace \(Y\) by \(\frac{Y-\underline{Y}}{\bar{Y}-\underline{Y}}\))

- Finite bounds needed as otherwise effect could be arbitrarily large or small

**Theorem**: \[\begin{gather*} E[Y^1-Y^0]\in[\left\{E[Y|X=1]P(X=1)-E[Y|X=0](1-P(X=1))\right\}-P(X=1), \\ \left\{E[Y|X=1]P(X=1)-E[Y|X=0](1-P(X=1))\right\}+(1-P(X=1))] \end{gather*}\]**Corollary**: Width of possible interval learnable from data is 1 (as opposed to 2 without data) and is \([0,1]\) at largest, \([-1,0]\) at smallest, so worst case interval always contains 0.**Proof**: \(E[Y^1-Y^0]=E[Y^1X+Y^1(1-X)]-E[Y^0X+Y^0(1-X)]\) \(=E[Y^1X]-E[Y^0(1-X)]+E[Y^1(1-X)]-E[Y^0X]\)Have \(E[Y^1X]= E[Y|X=1]P(X=1)\), \(E[Y^0(1-X)]=E[Y|X=0](1-P(X=1))\)

Largest possible effect when \(Y^1=1\) when \(X=0\) and \(Y^0=0\) when \(X=1\), so \(E[Y^1(1-X)]-E[Y^0X]=1-P(X=1)\)

Smallest possible effect when \(Y^1=0\) when \(X=0\) and \(Y^0=1\) when \(X=1\), so \(E[Y^1(1-X)]-E[Y^0X]=P(X=0)\)

Upper and lower intervals in corollary follow when \(E[Y|X=1]=1, E[Y|X=0]=0\) vs \(E[Y|X=1]=0, E[Y|X=0]=1\), respectively

- You need theory to even get the sign right, at least in worst case.
- Why consider worst case?
- Persuade an adversarial audience: if you don’t, your referee or seminar audience will be happy to do it for you
- If you can add a minimal set of assumptions where worst case is informative, can just report that, without defending auxiliary parts of the model that your audience may dispute

- Anecdote: As an aspiring development economist, I used to attend seminars where Abhijit Banerjee, visiting that year, would sit in the audience. For every paper presented, he would wait until about halfway through and results and model presented, then propose some feature or behavioral pattern not measured in the data but highly plausible based on theory and institutional background, which he usually knew better than the presenter, which was consistent with all the data presented so far but would reverse the sign of the policy effect, and usually convince the audience that the speaker had their conclusions exactly backward.
- Lesson: The fact that you wrote down a model which matches observed data with a particular sign of effect doesn’t mean you have measured the effect. The fact that you can’t rationalize the opposite will mostly be taken as a sign that you are not as clever as Abhijit Banerjee.

- Try out a policy: Does it work? “Program evaluation”
- Estimation of treatment effect of a policy most relevant when plan is to actually implement that policy
- In program evaluation, end goal is often decision to implement or not
- Average causal effect is right thing to estimate if \(Y\) is measure of outcome relevant to goal of program, welfare is measured by average \(Y\) and population over which experiment is run is representative of population to which program applied

- Test a theory: Does an effect exist?
- Ideal experiment directly discriminates between key implication of theory and some alternatives

- Ideal experiment directly discriminates between key implication of theory and some alternatives
- Learn about new pathways: See something that wasn’t supposed to happen
- May want to collect extensive set of outcome measures to re-evaluate and form new theories

- Learn a structural parameter: How big is that elasticity?
- Using direct variation isolates measurement from other sources of variation and possibility of misspecification of auxiliary model components
- E.g., if studying effect of a policy, may need model of political economy to account for selection into implementation

Paper | Topic | Theoretical Motivation | Intervention | Outcome |
---|---|---|---|---|

Kremer and Glennerster (2011) | Demand | Law of demand | price of health goods | quantity \(\downarrow\) |

Jensen and Miller (2008) | Demand | Giffen goods | coupons for grain by income | quantity \(\uparrow\) |

Caunedo and Kala (2021) | Industrialization | Lewis dual sector models | Subsidize tractor rental | non-farm labor supply \(\uparrow\) |

Crépon et al. (2013) | Labor Search | Matching function | job search assistance by city | congestion \(\uparrow\) |

Breza, Kaur, and Shamdasani (2018) | Wage Rigidity | Bewley (1999) | unequal wages within teams | productivity \(\downarrow\) |

Balboni et al. (2021) | Poverty trap | inflection point in returns | give people asset (cows) | inflection point found |

- Suppose binary treatment assigned randomly: \((Y^0,Y^1) \perp X\)
- Difference in conditional expectations then equals ATE
- Further suppose we have a sample from the model of size \(n\), \((Y_i,X_i)_{i=1}^{n}\) with \(n_0\) units with \(X_i=0\) and \(n_1\) with \(X_i=1\)
- Maintain that potential outcomes \((Y^0_i,Y^1_i)\) satisfy causal consistency \(Y_i=X_iY^1_i+(1-X_i)Y_i^0\)

- Simplest estimator is \[\frac{1}{n_1}\sum_{i=1}^{n}Y_i 1\{X_i=1\}-\frac{1}{n_0}\sum_{i=1}^{n}Y_i 1\{X_i=0\}\]
- Difference in means is (conditionally on \(X\)) unbiased for and under a LLN consistently estimates ATE
- In small samples, estimate not exact
- May have drawn sample where unobserved variables differ between treatment and control groups

- Write potential outcomes model in more familiar form \[Y_i=Y_i^0+(Y_i^1-Y_i^0)X_i\]
- Define \(\beta_{0,i}=Y_i^0\), \(\beta_{1,i}=Y_i^1-Y_i^0\), then \[Y_i=\beta_{0,i}+\beta_{1,i}X_i\]
- Slope is treatment effect, intercept is value if not treated
- Result is a linear model with
*random coefficients* - Like linear model, but slope terms no longer constant

- Taking averages, can write as
- \(\beta_{0,i}:=\bar{\beta}_0+e_{0i}\)
- \(\beta_{1,i}:=\bar{\beta}_1+e_{1i}\)
- \(E[e_{0i}]=E[e_{1i}]=0\)

- Random coefficients model becomes \[Y_i=\bar{\beta}_0+\bar{\beta}_1X_i+e_{0i}+X_ie_{1i}\]
- A standard linear model with heteroskedastic errors
- Slope coefficient \(\bar{\beta}_1\) is ATE
*Endogeneity*: under nonrandom assignment, residual may be correlated with \(X_i\)

- If X assigned randomly
- \(X_i\perp e_{0i}\) No selection bias
- \(X_i\perp e_{1i}\) Treatment effect independent of treatment assignment

- \(\hat{\beta}_1\) OLS estimator same as difference in means
- Heteroskedasticity has meaningful interpretation
- Variance of residual \(e_{0i}+X_ie_{1i}\) depends on \(X\) so long as \(e_{1i}\neq 0\)
- “Heterogeneous treatment effects”

- Model looks exactly like linear regression with heteroskedasticity
- Suggests OLS with robust standard errors valid inference on ATE
- Equivalent to two-sample t-test on difference in means with unequal variances

- While that’s usually not a bad choice, there is a little bit of subtlety here
- What are the problems?
- Choice of hypothesis: sharp vs weak null, conditional on data or not
- Finite sample properties
- Experimental design considerations mean \(X\) chosen independent of \(Y\) but often
*not*i.i.d.

- Use of White robust SEs valid under iid sampling, with asymptotically correct coverage with respect to distribution of Ys and Xs, for (weak) null hypothesis that ATE=0.
- Finite sample properties could be improved: use modified degree of freedom adjustment (HC2 or HC3) instead of version without (HC0) or using numer of regressors (HC1)
- Or bootstrap…

- Alternative Method: randomization test

- For \(j=1...J\), draw \(\{X^j_i\}_{i=1}^{n}\) from known assignment mechanism (assumed independent of \((Y_i^1,Y_i^0)_{i=1}^{n}\))
- Compute \(\{\widehat{\beta}_1^{j}\}_{j=1}^{J}\) by computing difference in means as if \(\{X^j_i\}_{i=1}^{n}\) had been the true realization of \(j\)
- Calculate \(p=\frac{1}{J}\sum_{j=1}^{J} 1\{|\widehat{\beta}|>\widehat{\beta}^j\}\)
- If \(p<\alpha\) reject sharp null

- Tests (“Fisher”) sharp null: \(H_0:\) \(Y_i^1=Y_i^0\) for all \(i\) as opposed to (“Neyman”) weak null \(H_0:\) \(E[Y_i^1-Y_i^0]=0\)
- No heterogeneity in treatment effects

- Measures, with probability of rejection exactly the nominal size \(\alpha\) with respect to \(P(\{X_i\}_{i=1}^{n}|\{Y_i^0,Y_i^1\}_{i=1}^{n})\)
- Distribution of treatment
*holding fixed potential outcomes* - Under random assignment

- Distribution of treatment
- “Design-based” paradigm: measure uncertainty in treatment assignment only, not error in extrapolation to new units
- Useful method for exactly measuring uncertainty due to random assignment
- Not directly comparable to ATE mean inference
- Power properties also not directly ranked: see Ding (2017)

- Common to collect covariates \(Z\) even with random assignment of \(X\) independent of \(Z\)
- How can/should we use them?

- Table 1 of most experimental papers reports
**covariate balance**- Mean of \(Z_j\) in \(X=1,X=0\) groups
- Under random \(X\), means of \(Z\) distributions should on average be similar
- If not, might be sign of fault in random number generator: may worry that unobservables also not balanced

- Difference in means remains unbiased, consistent, simple, when \(Z\) present, but may be inefficient
- Can do instead OLS with added covariates or interacted: (1) or (2) \[Y_i=\beta_0+\beta_1 X_i+Z_i^\prime\gamma+u_i\] \[Y_i=\beta_0+\beta_1 X_i+Z_i^\prime\gamma+X_i*(Z_i-\bar{Z}_i)'\delta+u_i\]

- Amazingly, \(\widehat{\beta}_1\) remains consistent even with misspecification in both settings (Imbens and Rubin (2015) Ch7)

**Proof**for case (1)- Limit objective is \(E[(Y_i-\beta_0-\beta_1 X_i-Z_i^\prime\gamma)^2]\)
- Equivalent to \(E[(Y_i-\tilde{\beta}_0-\beta_1 X_i-(Z_i-E[Z_i])^\prime\gamma)^2]\) (with \(\tilde{\beta}_0=\beta_0+E[Z_i]^{\prime}\gamma\)) \[= E[(Y_i-\tilde{\beta}_0-\beta_1 X_i)^2]+E[((Z_i-E[Z_i])^\prime\gamma)^2]-2E[(Y_i-\tilde{\beta}_0-\beta_1 X_i)(Z_i-E[Z_i])^\prime\gamma)]\]
- Last term equals, by independence of \(X\) and \(Z\) \[=-2E[Y_i(Z_i-E[Z_i])^\prime\gamma]+2E[\tilde{\beta}_0+\beta_1 X_i]E[(Z_i-E[Z_i])^\prime\gamma]\]

\[=-2E[Y_i(Z_i-E[Z_i])^\prime\gamma]\] - Total sum equals \(E[(Y_i-\tilde{\beta}_0-\beta_1 X_i)^2]\) plus terms not dependent on \(\beta\)
- So arg min with respect to \(\beta_1\) is exactly as for univariate regression, which is the ATE!
**QED** - Proof for case (2) similar.
- Reason this works is that OLS is first example we will see of a
*multiply robust*estimator- Partitioned regression formula: \(\widehat{\beta}_1=(\tilde{X}^\prime\tilde{X})^{-1}\tilde{X}^{\prime}Y\) where \(\tilde{X}=(I-P_Z)X\) residual from regression of X on \(Z\) and constant
- Consistent with covariates that correctly model probability of treatment
*or*that correctly specify conditional expectation function - In RCT, treatment’s best predictor is a constant, so accounted for by constant term in OLS

- Variance formulas show asymptotic efficiency gains, with variance never larger
- Variances replaced by residual variances after prediction with \(Z\)
- If \(Z\) predictive of outcome, can increase precision to include it

- Some finite sample bias (\(O(\frac{1}{n})\)) remains unless correctly specified (eg saturated interactions with discrete covariates)
- In very small samples, without highly predictive covariates, may prefer not to adjust

- May want to model outcome with nonlinear estimators like logit, probit, etc
- No multiple robustness property: need to correctly specify conditional distribution
- Precision advantages if correctly specified, but otherwise major bias, even in RCT

- If target is some more general parameter, as in a structural model, need to consider carefully how experiment influences outcome
- What parameters change if results do? Will depend on both your model and estimation method
- If ultimate goal is downstream, like policy recommendation, maybe treat this as estimand of interest

- Before running experiment, should make sure that changing experimental results changes conclusions in model
- Otherwise, model and experiment are poor fit
- Maybe you need to experiment on different measure that theory says is actually discriminative
- Maybe you need to change your model to allow sources of variation seen in the data or use a more robust estimation method

- Model simulation is a straightforward way to assess this
- Forward simulate from your model under different parameter values to see how experimental measures change
- Apply your estimation method to simulated data from experiments to see if parameters can be recovered
- Andrews, Gentzkow, Shapiro provide quantitative sensitivity measures

- One of the biggest strengths of experimental methods is that you get to choose many features of data to inform results
- Biggest question is which experiment to run: what intervention will best answer substantive question
- If you have a model, consider what will happen your theory is right, and if it’s wrong.
- What will it look like in both cases? Simulate your analysis under both.

- For testing, above procedure leads to
*power analysis*- Want experiment features so that if \(X=1\) is better, have high probability of choosing that, and if \(X=0\) better, high probability of choosing that

- Tools like DeclareDesign (Blair et al. (2019)) let you simulate through many of the statistical aspects of these choices
- Realistic analysis of power for different choices will quickly show you why the binary or sometimes ternary experiment remains overwhelmingly the most common
- To distinguish reliably between more options needs much more data
- Fitting complicated models likely needs more data than you think: even an interaction term may require massive increase in sample size

- Things to consider:
- Which variables to intervene on
- Number of levels of intervention
- Sample size in each level
- Randomization strategy

- Criteria to assess your design
- How precise are your estimates?
- Can you distinguish between theories with your sample size?
- Cost
- Feasibility
*Ethics*

- Structure will depend on your budget and question of interest
- Web experiments can be fast, easy, and so are ripe for sophisticated methods
- Field experiments can be expensive, slow, rife with implementation problems, so strive for simplicity and advance planning
- Lab experiments can be in between

- As an experimenter, you have a choice of distribution
- Allow independence, but also aim for efficiency of analysis

- Simplest random assignment rule is \(X_i\overset{iid}{\sim}\)Bernoulli(p)
- But poor properties: may assign most or all observations to one of treatment or control
- If all, effect not even estimable. If most, variance is huge

- An alternative that avoids this is the “completely randomized” scheme
- Fix \(n_0,n_1\), and choose uniformly among all \(\left(\begin{array}{c}n\\n_{1}\end{array}\right)\) assignments with fixed numbers
- \(X_i\) no longer independent, since drawing \(1\) reduces chance next \(X\) is \(1\)
- But any method that conditions on \(X\) (permutation test, OLS) unaffected, and versions of LLN still apply

- With covariates can stratify, sampling conditionally on value of covariate
- If completely randomized within stratum (level of covariate) with identical probability in each, achieve no correlation, but avoid covariate imbalance
- Level of precision is increasing with fineness of conditioning, all the way down to randomizing within pairs
- Or, if criterion is expected ex ante MSE on average over possible distributions Kasy (2016) suggests a non-random assignment
- This is a little extreme (hard to analyze data), but similar effects achievable by re-randomizing until table passes a balance test

- With group level units, like households or villages, there are cluster randomized versions of above
- Main difference is assigning within-cluster variation if treatment varies by group and individual

- In order to make plans, often better to iterate: think first, then get data
- “Mathematics is the part of [science] where experiments are cheap” - V.I. Arnol’d

- But may need to refine results: do lit review, use observational data
- In between: pilot study:
- Go there, see if intervention/measurement scheme can even be set up as planned
- Use the results for further planning: choice of sample size, treatment values, implementation details, etc

- Being on the ground and talking to people will usually turn up features of setting that you didn’t think of

- On the frontier are automated methods for updating results from experiments adaptively
- Defines a sequential protocol for how to change based on preliminary results
- “Bandit algorithms” widely used in tech to find highest value treatment in fewest number of samples

- Adaptivity creates challenges for ex post inference, since data no longer iid
- Gupta, Lipton, and Childers (2021) defines adaptive methods when target is defined by GMM as in structural models
- Case where bandit method used is harder: “Off-policy” setting creates bias

- There are a lot of analysis and design choices going into trials, and seemingly innocuous choices can change results
- If you are trying to get significant results, often have ways to do so
- When systematic, creates bias in distribution of reported effects, usually to larger values

- One check against this is to publicly preregister your trial and precise analysis method, so decisions known to be made in advance
- Guards against ex-post specification searches, enhances credibility of results, helps convince regardless of outcome
- Does limit flexibility to discover new outcomes.
- But: you can include both pre-registered and novel analyses, clearly separated

- Abaluck et al. (2021) ran a cluster-randomized experiment on 600 villages in Bangladesh providing masks and public encouragement to wear them, measuring rates of compliance and then covid tests
- Recht (2021) asks several critical questions about results
- How is effect (test positivity) computed and averaged across units?
- Method is “a generalized linear model (GLM) with a normal family and identity link.” Is that robust for estimation? Inference?
- Claim “P-values and confidence intervals associated with a regression are valid only if the model is true. What if the model is not true? If the model is wrong, the error bars are meaningless.”

- Is Recht right in this setting? How about in general?
- In a later post, he shows in simulations that in an RCT analyzed with logistic regression, including the wrong set of covariates may give a severely biased coefficient on the treatment

- Note that “(GLM) with a normal family and identity link” is exactly OLS, with treatment indicator at the village level, and the authors computed p-values 2 ways: by randomization test (per the randomization method, which was stratified at cross-and-within village level) and by heteroskedasticity-robust standard errors clustered at the village level

- Experiments make the usually difficult identification task of causal inference easy
- But choosing an experiment to run and executing it takes many careful choices
- Analyses of experiments can be made robust

Abaluck, Jason, Laura H Kwong, Ashley Styczynski, Ashraful Haque, Md Alamgir Kabir, Ellen Bates-Jefferys, Emily Crawford, et al. 2021. “The Impact of Community Masking on COVID-19: A Cluster-Randomized Trial in Bangladesh.” J-PAL.

Balboni, Clare A, Oriana Bandiera, Robin Burgess, Maitreesh Ghatak, and Anton Heil. 2021. “Why Do People Stay Poor?” National Bureau of Economic Research.

Bewley, Truman F. 1999. *Why Wages Don’t Fall During a Recession*. Harvard University Press.

Blair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys. 2019. “Declaring and Diagnosing Research Designs.” *American Political Science Review* 113: 838–59. https://declaredesign.org/paper.pdf.

Breza, Emily, Supreet Kaur, and Yogita Shamdasani. 2018. “The Morale Effects of Pay Inequality.” *The Quarterly Journal of Economics* 133 (2): 611–63.

Caunedo, Julieta, and Namrata Kala. 2021. “Mechanizing Agriculture.” National Bureau of Economic Research.

Crépon, Bruno, Esther Duflo, Marc Gurgand, Roland Rathelot, and Philippe Zamora. 2013. “Do Labor Market Policies Have Displacement Effects? Evidence from a Clustered Randomized Experiment.” *The Quarterly Journal of Economics* 128 (2): 531–80.

Ding, Peng. 2017. “A Paradox from Randomization-Based Causal Inference.” *Statistical Science*, 331–45.

Gupta, Shantanu, Zachary C. Lipton, and David Childers. 2021. “Efficient Online Estimation of Causal Effects by Deciding What to Observe.” *NeuRIPS*.

Imbens, Guido W, and Donald B Rubin. 2015. *Causal Inference in Statistics, Social, and Biomedical Sciences*. Cambridge University Press.

Jensen, Robert T, and Nolan H Miller. 2008. “Giffen Behavior and Subsistence Consumption.” *American Economic Review* 98 (4): 1553–77.

Kasy, Maximilian. 2016. “Why Experimenters Might Not Always Want to Randomize, and What They Could Do Instead.” *Political Analysis* 24 (3): 324–38.

Kremer, Michael, and Rachel Glennerster. 2011. “Improving Health in Developing Countries: Evidence from Randomized Evaluations.” In *Handbook of Health Economics*, 2:201–315. Elsevier.

Recht, Ben. 2021. “Arg Min Blog: Effect Size Is Significantly More Important Than Statistical Significance.” Stanford University. http://www.argmin.net/2021/09/13/effect-size/.