**Regression discontinuity**(Thistlethwaite and Campbell (1960)) is a conceptually simple way to leverage a common situation in real data- Treatment assignment is not haphazard or random but instead follows a strict and simple
**rule** - Bureaucracies, institutions, algorithms allocating a discrete resource systematically may make decision by a well-defined
*threshold*- Admission by standardized tests, Election by majority vote, Program eligibility by cutoff in income, assets, age, Drug recommendation by checklist (blood pressure, metabolites, etc)

- Key to using such thresholds is idea that they may be in some sense
*arbitrary*- Criterion in rule usually based on something specifically chosen to be relevant to the outcome: blood pressure and drug effectiveness, test score and school performance, vote share and politician quality
- But the exact
*level*of threshold is less meaningful

- Discrete nature of treatment (by simplicity or necessity) turns infinitesimal differences between units into large changes in treatment
- This contrast between
*small*changes in unit attributes and*big*changes in treatment, makes units on either side “close to” comparable - Requires assumptions on outcome process, but fairly weak ones
- Just that “small” changes in assignment variable do not lead to “big” differences in units that would make them comparable except for treatment

- Formalizing “small” and “big” is major practical challenge in using these methods
- Applicability and form of analysis depends on quantifying these notions

- Statistical literature buries substantive assumptions about relative size and shape of effects in technical conditions and default behavior of easy-to-use software
- Not their fault: need for precision requires jargon and mathematics, and defaults are sensible choices for typical social science applications
- But you should know what you’re doing and why, and actively interrogate and test assumptions

- Goal is (a) causal effect of binary treatment \(X\) on outcome \(Y\)
- Treatment assignment determined by a deterministic rule
- Continuous variable \(R\) determines treatment \(X\): called
**running variable**or**assignment variable** - Units with \(R\geq c\) get treatment \(X=1\), units with \(R<c\) don’t \(X=0\)
**Assignment rule**\(X=1\{R\geq c\}\)

- Assume, as usual,
**causal consistency**: \(Y=Y^1X+Y^0(1-X)\) - By the assignment rule, \((Y^1,Y^0)\perp X|R\), trivially since conditional on \(R\), \(X\) is constant and so is independent of any other variable

- But by deterministic rule, also case that \(P(X=1|R=r)\) is 0 or 1 for all \(w\)
- We cannot use \(R\) to identify effect by adjustment: overlap fails by construction

- Additional assumption that will allow identification:
**continuity**of conditional potential outcomes as functions of \(R\) at \(c\)- \(\underset{e\searrow 0}{\lim}E[Y^1|R=c+e]=E[Y^1|R=c]\), \(\underset{e\searrow 0}{\lim}E[Y|R=c-e]\)
- Stronger versions possible, eg continuity of \(F(Y^1|R=r)\), \(F(Y^0|R=r)\)

**Regression discontinuity**\(\tau^{RD}:=\underset{e\searrow 0}{\lim}E[Y|R=c+e]-\underset{e\searrow 0}{\lim}E[Y|R=c-e]\)

- Under consistency, assignment rule, and continuity of potential outcomes
- \(\tau^{RD}:=E[Y^1-Y^0|R=c]\) Local Average Treatment Effect at the threshold

**Proof**:- \(E[Y|R=r]=E[Y^1|R=r]1\{r\geq c\}+E[Y^0|R=r]1\{r<c\}\) (consistency and assignment rule) so
- \(\tau^{RD}=\underset{e\searrow 0}{\lim}E[Y^1|R=c+e]-\underset{e\searrow 0}{\lim}E[Y^0|R=c-e]\)
- \(=E[Y^1|R=c]-E[Y^0|R=c]\) (continuity)

- Lee and Lemieux (2010) show visually

- Estimated quantity is a treatment effect only conditional on being at the cutoff: marginal units only
- May or may not be representative of effect at other levels of \(R\): patients just barely sick enough to get drug, etc
- Because conditioning group observable, can at least identify them and their attributes to see how they differ from other units

- Continuity says structural equation for \(Y\) can depend on running variable, either directly or indirectly through associated confounders
- But magnitude of effects, at least locally near the cutoff, can’t jump in the way treatment does

- Bound on the degree of change in the outcome for small changes in \(R\)
- Maybe biological mechanisms via which metabolite associated with disease severity involve proportionate changes

- Especially reasonable when running variable is noisy or imperfect proxy for causally meaningful effects
- Test scores proxy for academic ability, but 1 point difference mostly noise, and in any case “real” ability difference probably not large enough to be meaningful

- Unreasonable when other things change discontinuously at cutoff
- Administrative borders: lots of laws, regulations, taxes change as soon as you enter an administrative unit

- A common reason for discontinuous changes at cutoff is that existence of cutoff itself influences behavior discontinuously
- If people can control their \(R\), and they care about whether they are treated, they might do so
- To extent that preferences or ability to do this are heterogeneous, manipulation may induce differences in \(Y^x\) on different sides of the jump
- This makes discontinuities in, e.g. tax and welfare code, likely a bad fit for RD (though manipulation may itself be informative)

- If manipulation is imprecise, maybe not a problem: suppose units can only choose \(r^*\) but \(R=r^*+\epsilon\) for continuously distributed independent \(\epsilon\)
- Then \(E[Y^x|R=r]=\int\int y^xf(y^x|R=r,\epsilon=e)dy^xf_\epsilon(e)de=\int\int y^xf(y^x|r^*=u)f_\epsilon(r-u)dy^xdu\) which can be continuous in \(r\) even if \(f(y^x|r^*=u)\) is discontinuous
- E.g., if people can study for a test, but not learn answers exactly, still preserve continuity (though this does affect properties of marginal group)

- Sometimes see RD model/method analyzed and described in terms of “local randomization” (Cattaneo, Frandsen, and Titiunik (2015))
- Near cutoff, whether you are on one side or another is “effectively random” and you can treat data “as if” an experiment near the cutoff
- Formalization: \(Y^{x,r}=Y^{x}\) for \(r\in [\underline{c},\bar{c}]\) \(\underline{c}<c<\bar{c}\) and \(R\perp(Y^0,Y^1)|R\in [\underline{c},\bar{c}]\)
- In this situation,
**no**effect of or association of running variable with outcomes in a window - This is
*much stronger*assumption than continuity- Asks for potential outcomes functions to be
*exactly flat*near the cutoff - Maybe plausible if assignment rule based on variable
*not at all*important to outcome

- Asks for potential outcomes functions to be
- Aside from deliberate random assignment, I struggle to think of cases where applicable
- But if you do have that,
`rdlocrand`

implements Fisher-style tests

- But if you do have that,

- Continuity is fairly weak notion formalizing small changes in \(R\) leading to small changes in \(Y^x\)
- For every \(\epsilon>0\), \(\exists \delta>0\) s.t. \(d(r,c)<\delta\implies d(E[Y^x|R=r],E[Y^x|R=c])<\epsilon\)

- So, to get \(2\epsilon\)-accurate estimate of \(\tau^{RD}\), estimate conditional expectation functions for \(r\) within \(\delta\)
- Simple implementation is to take window of width \(\delta\) and estimate expectations by conditional means
- \(\widehat{\beta}_0^{-}=\frac{\sum_{i=1}^{n}Y_i1\{c-\delta<R_i<c\}}{\sum_{i=1}^{n}1\{c-\delta<R_i<c\}}\), \(\widehat{\beta}_0^{+}=\frac{\sum_{i=1}^{n}Y_i1\{c\leq R_i<c+\delta\}}{\sum_{i=1}^{n}1\{c\leq R_i<c+\delta\}}\)
- \(\widehat{\tau}^{RD}=\widehat{\beta}^{+}_0-\widehat{\beta}^{-}_0\)

- By law of large numbers and continuity (so long as density in region \(>0\)), holding fixed \(\delta\),
- \(\widehat{\beta}_0^{-}\overset{p}{\to}\int E[Y|R=r]f(R=r|c-\delta<R<c)dr\in[E[Y^0|R=c]-\epsilon,E[Y^0|R=c]+\epsilon]\)
- Likewise for \(\widehat{\beta}_0^{+}\), so \(\text{plim }\widehat{\tau}^{RD}\) is in \(E[Y^1-Y^0|R=c]\pm2\epsilon\)

- This gets called the local constant or Nadaraya-Watson estimator
- Not bad: for any \(\epsilon\), we have estimator that converges to something \(2\epsilon\) close to true effect
- Problem 1: continuity doesn’t tell us anything about \(\delta\) we need
- Solution: need stronger conditions quantifying \(\delta\): differentiability, smoothness

- Problem 2: would like to converge to exact value
- Solution: shrink \(\delta\) with sample size so that bias disappears

- Issue of what to assume about local response of conditional mean, and based on it, how much to shrink window, is main source of difficulty

- What we actually need is an estimate of the conditional expectation function at a point
- We can get this from any consistent estimator

- OLS consistent, unbiased estimate of a conditional expectation function if correctly specified
- \(y=\beta_0+\tau1\{R\geq c\} + \beta_1 R_i +u\): additive effect of treatment, linear in running variable
- \(y=\beta_0+\tau1\{R\geq c\} + \beta_1 (R_i-c)1\{R< c\}+\beta_2(R_i-c)1\{R\geq c\} +u\): additive effect of treatment, linear in running variable with different slopes

- Problem with this is that if incorrectly specified, all you get is best linear predictor
- Average is with respect to integrated mean squared error over entire distribution of \(R_i\)
- Goal is estimate of \(\tau\), but estimator adjusts slope to get smaller errors in regions far away
- Helpful to use this information if global shape known, as you can extrapolate from data anywhere in sample
- But still optimizes the wrong criterion for prediction at a point

- Early attempts to solve this suggested adding polynomials to make shape more flexible and reduce bias
- This is no longer recommended (Gelman and Imbens (2019)) due to potential to magnify bias

- With improved methods, parametric RD mainly relegated to situations where some nonstandard aspect of the analysis makes calculations with local methods difficult

- The tried and true estimator for nonparametric RD
- Classical, easily interpretable, well-studied

- Idea: to estimate function near a point, use function estimation methods, but use only data near that point
- Choose a window around \(x\) and run (weighted) regression in that window
- Account for slope and curvature and extrapolate locally by using polynomial of order \(p\)
- \(p=0\) is
*local constant*, \(p=1\) is*local linear*, \(p=2\) is*local quadratic* - \(p=1\) most common choice for RD, default in a lot of software

- \(p=0\) is
- Let \(K(u)\) be a
**kernel**function weighting points within window- Boxcar/Uniform: \(K(u)=1\{|u|\leq1\}\), Triangular: \(K(u)=\max\{0,(1-|u|)\}\), Epanechnikov: \(K(u)=\max\{0,\frac{3}{4}(1-u^2)\}\)

- Choose how spread out to make the average by adjusting bandwidth \(h\)
- (One-sided) local polynomial regression estimators of order \(p\) with bandwidth \(h\)
- \(\widehat{\beta}^{-}(r):=(\widehat{\beta}^{-}_0(r),\hat{\beta}^{-}_1(r),\ldots)=\underset{(\beta_0,\beta_1)}{\arg\min}\sum_{i=1}^{n}1\{R_i<c\}K(\frac{R_i-r}{h})(Y_i-\sum_{j=1}^{p}\beta_j(R_i-r)^j)^2\)
- \(\widehat{\beta}^{+}(r):=(\widehat{\beta}^{+}_0(r),\hat{\beta}^{+}_1(r),\ldots)=\underset{(\beta_0,\beta_1)}{\arg\min}\sum_{i=1}^{n}1\{R_i\geq c\}K(\frac{R_i-r}{h})(Y_i-\sum_{j=1}^{p}\beta_j(R_i-r)^j)^2\)
- \(\widehat{\tau}^{RD}=\widehat{\beta}^{+}_0(c)-\widehat{\beta}^{-}_0(c)\)

- With uniform kernel, this is just regression in a window

**bandwidth**\(h\),**kernel**\(k(.)\), polynomial**order**\(p\)- Hard part is how to pick these and why

- Various concerns, assumptions, tradeoffs make this difficult and context-dependent
- Plenty of work on “optimal” choices, all of which uses different criteria and makes different assumptions
- You have to decide which assumptions and goals are relevant to your setting
- Statistical criteria (bias, variance, coverage, interval width, robustness, etc) well-quantified
- You also want your work to be transparent, impactful, trusted, and trustworthy
- Visualization is crucial: plot should allow users to assess results and their quality
- Possibility of specification search favors methods which are conventional
- Using “default” methods whenever possible: RD software often has fairly reasonable choices implemented, even if not exactly the ones you would have made
- Have clear and convincing explanations for modifications to defaults based on features of your setting
- Display results under range of plausible choices to demonstrate robustness

- An assumption under which local polynomial estimators are a good choice is differentiability
- More specifically, the quality of approximation of a Taylor expansion of order \(p\)

- Heuristic argument: consider local linear regression with uniform kernel and bandwidth \(h\)
- Assume \(E[Y|R=r]\) is twice differentiable with second derivative bounded
- Taylor expansion is then \(E[Y|R=r]=f(r)=f(c)+f^\prime(c)(r-c)+\frac{1}{2}f^{\prime\prime}(\tilde{r})(r-c)^2\) and \(Y_i=f(R_i)+u_i\)

- Taylor expansion is then \(E[Y|R=r]=f(r)=f(c)+f^\prime(c)(r-c)+\frac{1}{2}f^{\prime\prime}(\tilde{r})(r-c)^2\) and \(Y_i=f(R_i)+u_i\)
- Normalizing \(c=0\) and letting \(\vec{R}_i=(1,R_i,...,R_i^p)\)
- \(\widehat{\beta}_0(r)=e_1^\prime(\sum_{i=1}^{n}K(\frac{R_i}{h})\vec{R}_i\vec{R}_i^\prime)^{-1}\sum_{i=1}^{n}K(\frac{R_i}{h})\vec{R}_iY_i\) \(=\sum_{i=1}^{n}w(R_i,h)Y_i\) linear combination for some weight that doesn’t depend on \(Y\)

- \(E[E[\widehat{\beta}_0(0)|R]]-E[Y|R=0]=\frac{1}{2}E[\sum_{i=1}^{n}w(R_i,h)[f^{\prime\prime}(\tilde{R}_i)-f^{\prime\prime}(\tilde{r})]r^2]\leq Ch^2\)
- Because OLS is unbiased for linear functions and second derivatives and weights are bounded

- By uniform kernel and bandwidth \(h\), density of \(R\) bounded above and below in neighborhood of 0, number of observations given non-zero weight is proportional to \(nh\)
- Since OLS has variance proportional to inverse of sample size, variance is proportional to \(\frac{1}{nh}\)

- Since OLS has variance proportional to inverse of sample size, variance is proportional to \(\frac{1}{nh}\)
- Mean squared error of estimate \(E[\widehat{\beta}_0(0)-E[Y|R=0])^2]=\) \(\text{Bias}(\widehat{\beta}_0(0))^2+Var(\widehat{\beta}_0(0))\) \(\propto h^4+\frac{1}{nh}\)
- Taking FOC to minimize yields \(h\propto n^{-1/5}\) and MSE \(\propto n^{-4/5}\): proportionality depends on kernel choice, \(R\) distribution, smoothness bound
- Compare to correctly specified OLS, bias 0, variance = MSE \(\propto n^{-1}\)

- Can show this is optimal minimax rate for
*Hölder class*\(\Lambda^2(C)\) of functions with bounded second derivatives (Tsybakov (2009))- \(\underset{\widehat{f}}{\inf}\underset{f\in\Lambda^2(C)}{\sup}MSE(\widehat{f},f)\propto n^{-4/5}\), so estimate is pretty good

- As matter of practice, constant is unknown, so we can’t choose optimal bandwidth
- Kernel also affects outcome, but enters into constant and not rate, so less important overall
- In this setting, for local linear regression estimator, Imbens and Kalyanaraman (2012) describe estimator for the constant term
- Relies on preliminary higher order local polynomial to get higher derivatives, variance parameters, etc

- This is the default in most older RD software and still probably a tolerable default for point estimates
- Cross-validation with MSE criterion also feasible
- That said, why use method optimal for exactly 2 derivatives, instead of more or less?
- Minimax MSE with \(d\) derivatives \(\propto n^{\frac{-2d}{2d+1}}\)

- Probable reason: 2 derivatives is class for which local linear regression optimal
- Procedure simple, transparent, visualizable, has connections to familiar OLS

- Higher order rates require methods with complications or drawbacks
- Higher order kernels may have worrisome properties like some negative weights
- Higher order polynomials may overextrapolate (though reducing bandwidth should help with this)

*Adaptive*estimators have serious advantage of achieving optimal rates without needing to know order- But methods appear complicated, computationally challenging, and since not widely used, lack reliable software implementations and communities of practice to address possibly unintended sources of fragility

- In practice you are expected to report sensitivity analyses to bandwidth, and maybe order of polynomial and/or choice of kernel anyway
- If results are not that sensitive to choice, optimality probably not main concern

- If goal is inference rather than point estimation, criterion changes entirely and so do procedures
- MSE equalizes weight placed on variance and squared bias, while inference prioritizes coverage
- With MSE-optimal bandwidth, local linear estimator converges in distribution to a normal at rate \(\sqrt{nh}\)
- \(\sqrt{nh}(\widehat{\beta}(r)-E[Y|R=r])=\sqrt{nh}(\widehat{\beta}(r)-E[\widehat{\beta}(r)])+\sqrt{nh}(E[\widehat{\beta}(r)]-E[Y|R=r])\)
- Because bias term is size \(h^2\) and optimal \(h\propto n^{-1/5}\), scaled bias term is \(O(1)\): doesn’t disappear in large samples
- Limiting normal distribution has non-zero mean!

- Given that bias is unknown, can’t just rely on variance estimates
- Option 1: undersmoothing: choose smaller than MSE-optimal bandwidth \(h=o(n^{-1/5})\)
- Bias term then asymptotically negligible, at cost of slower MSE overall
- Intervals wider since \(\sqrt{nh}\) scaling is slower if h is
- Asymptotically negligible doesn’t mean ignorable in small samples: may have poor coverage
- Bandwidth choice methods become inapplicable

- Option 2: use MSE-optimal bandwidth (as in Imbens and Kalyanaraman (2012)), but estimate and correct for for the bias term in CI
- Calonico, Cattaneo, and Titiunik (2014) propose an estimate, but note that bias correction itself has variability
- They therefore propose an adjusted variance estimator taking into account all terms
- With implementation in
`rdrobust`

R/Stata libraries and extensive documentation and extensions, may be most popular and widely applicable default implementation

- Above inference procedures assumed local polynomial estimation and sufficient derivatives to estimate conditional expectation and its derivatives, to compute bandwidth and bias corrections
- How do assumptions relate to properties of inference?
- May compare procedures by (expected) length of (valid) confidence intervals, or power of tests

- Under assumptions needed for identification of RD,
**no**\(1-\alpha\) confidence interval can have finite expected length (Bertanha and Moreira (2020))- This is true even if we assume
*infinitely*differentiable functions - Manifestation of fact that without quantifying size, impossible to distinguish a discontinuity from a continuous and smooth but arbitrarily fast change
- Implication is that procedures like Calonico, Cattaneo, and Titiunik (2014) which estimate degree of smoothness must under-cover for some bad distributions in finite samples, even if for any fixed function they eventually converge

- This is true even if we assume
- Stronger model conditions can restore possibility of CIs with uniform coverage: an explicit upper bound on derivatives
- Armstrong and Kolesár (2018), Armstrong and Kolesár (2020) show that with upper bound \(p^{th}\) derivative, can upper bound bias of local estimator over all functions in class obeying the bound, construct critical values which provide \(1-\alpha\) coverage even in worst case
- If \(\widehat{\tau}^{RD}\) has worst case bias \(B(\widehat{\tau}^{RD})\) over class \(\mathcal{F}_M\), and standard error \(\widehat{se}(\hat{\tau}^{RD})\), CI is \(\widehat{\tau}^{RD}\pm cv_\alpha(B(\widehat{\tau}^{RD}/\widehat{se}(\hat{\tau}^{RD}))(\widehat{se}(\hat{\tau}^{RD}))\) where \(cv_\alpha(b)\) is critical value of \(|N(b,1)|\) distribution
- Derive max bias for function classes such as assuming slope near cutoff bounded by \(M\), or that \(|f(r)-f(c)-f^\prime(c)|\leq \frac{M}{2}(r-c)^2\) for local linear with some choice of kernel and bandwidth, and other weighting estimators
- You provide your intuition on how large slope/curvature could be in worst case
- Show that optimal choices for CI length, power, MSE, all fairly similar in worst case, and little scope for improvement by adapting
- Implemented in R library
`RDHonest`

- Bayesian estimation provides a way to express substantive knowledge about features of model in terms of average case properties instead of worst case belief
- Express these as a
*prior distribution*over functions on either side of cutoff

- Express these as a
- To allow flexible shape with values near cutoff informed mainly by local data, use nonparametric prior like
**Gaussian Process**(Williams and Rasmussen (2006))- Belief about conditional mean at any points \(r_1,r_2,...\) is multivariate Gaussian, with covariance based on distance
- Choices of covariance can express beliefs about smoothness and degree of information sharing

- Branson et al. (2019) describe implementation and properties for use in RD
- Uncertainty expressed as posterior distribution not same as a confidence interval, though can sometimes be shown to converge (Vaart and Zanten (2008))
- Properties encoded by prior (and likelihood) for shape of function may be subtle, so simulate to make sure prior focuses on plausible shapes
- Not widely used but probably should be…

- All RD estimates should be accompanied by a plot showing function estimates on both sides of the cutoff
- Include both local estimate and function estimates away from the cutoff, as well as intervals if possible
- In sparse data settings, these should overlay a scatterplot of \((R_i,Y_i)\) values to allow comparison of fit to raw data
- With larger sample sizes, may be hard to see much in data, so overlay
**binscatter**instead (Cattaneo et al. (2021))- Within narrow \(R\) bins, compute average \(Y\) value, plot for each bin, with width defined by levels or quantiles of \(R\)
- Usable as nonparametric estimator, but goal is a low-bias high-variance guide to patterns in data, so choose narrow bins that exhibit variability

- Example from Lee (2008) study of incumbency effect on re-election

- Visual ability to distinguish slopes of curves from a scatterplot less precise than formal statistical estimates
- But when jump is visually detectable, result is much more impactful and likely to be trusted
- Clear visibility criterion prioritizes size over power: may “miss” results which are well-distinguished with data
- This is arguably a feature, not a bug: to extent that statistical methods rely on manipulable choices, they may “go away” with alternate assumptions on bandwidth, order, etc, and a conservative criterion raises robustness to that
- Visibility also allows assessing other features of data, like variability, curvature, degree of change in these features away from jump
- Expresses visual check on tuning parameter choices (does data start to curve near or far from jump) and “placebo” check on what things look like without a jump
- Visualization choices can impact how striking result appears: raw data very noisy, power low, parametric model very tight, doesn’t provide extra checks. In between are undersmoothed methods with low bias: binscatter width interpolates.
- Plot window including just area near cutoff (see it better) vs wider range (assess global properties)
- “Default” is to plot parametric fit and binscatter, with some default bin width, on area wider than window, cut off only if wide R support obscures area near jump
- May raise or lower this, but people may question your choice

- Binscatter outliers right near the jump: suggests overfitting
- Parametric curve deviating strongly from binscatter right near discontinuity:
- Suggests extrapolation is doing a lot of work:
- Especially bad with quadratic/cubic fits, which can extrapolate curvature from other part of the space to curvature even in another direction near the discontinuity
- Part of reason why Gelman and Imbens (2019) suggest avoiding RD using higher order polynomials entirely

- Gelman referred to following RD estimate of air pollution on mortality using geographic border where regulations change as a “regression discontinuity disaster”
- Your goal as an applied researcher is to make sure your work is never highlighted on Gelman’s blog…