## Motivation

• Regression discontinuity (Thistlethwaite and Campbell (1960)) is a conceptually simple way to leverage a common situation in real data
• Treatment assignment is not haphazard or random but instead follows a strict and simple rule
• Bureaucracies, institutions, algorithms allocating a discrete resource systematically may make decision by a well-defined threshold
• Admission by standardized tests, Election by majority vote, Program eligibility by cutoff in income, assets, age, Drug recommendation by checklist (blood pressure, metabolites, etc)
• Key to using such thresholds is idea that they may be in some sense arbitrary
• Criterion in rule usually based on something specifically chosen to be relevant to the outcome: blood pressure and drug effectiveness, test score and school performance, vote share and politician quality
• But the exact level of threshold is less meaningful
• Discrete nature of treatment (by simplicity or necessity) turns infinitesimal differences between units into large changes in treatment
• This contrast between small changes in unit attributes and big changes in treatment, makes units on either side “close to” comparable
• Requires assumptions on outcome process, but fairly weak ones
• Just that “small” changes in assignment variable do not lead to “big” differences in units that would make them comparable except for treatment
• Formalizing “small” and “big” is major practical challenge in using these methods
• Applicability and form of analysis depends on quantifying these notions
• Statistical literature buries substantive assumptions about relative size and shape of effects in technical conditions and default behavior of easy-to-use software
• Not their fault: need for precision requires jargon and mathematics, and defaults are sensible choices for typical social science applications
• But you should know what you’re doing and why, and actively interrogate and test assumptions

## Problem Setup: (Sharp) Regression Discontinuity

• Goal is (a) causal effect of binary treatment $$X$$ on outcome $$Y$$
• Treatment assignment determined by a deterministic rule
• Continuous variable $$R$$ determines treatment $$X$$: called running variable or assignment variable
• Units with $$R\geq c$$ get treatment $$X=1$$, units with $$R<c$$ don’t $$X=0$$
• Assignment rule $$X=1\{R\geq c\}$$
• Assume, as usual, causal consistency: $$Y=Y^1X+Y^0(1-X)$$
• By the assignment rule, $$(Y^1,Y^0)\perp X|R$$, trivially since conditional on $$R$$, $$X$$ is constant and so is independent of any other variable
• But by deterministic rule, also case that $$P(X=1|R=r)$$ is 0 or 1 for all $$w$$
• We cannot use $$R$$ to identify effect by adjustment: overlap fails by construction
• Additional assumption that will allow identification: continuity of conditional potential outcomes as functions of $$R$$ at $$c$$
• $$\underset{e\searrow 0}{\lim}E[Y^1|R=c+e]=E[Y^1|R=c]$$, $$\underset{e\searrow 0}{\lim}E[Y|R=c-e]$$
• Stronger versions possible, eg continuity of $$F(Y^1|R=r)$$, $$F(Y^0|R=r)$$
• Regression discontinuity $$\tau^{RD}:=\underset{e\searrow 0}{\lim}E[Y|R=c+e]-\underset{e\searrow 0}{\lim}E[Y|R=c-e]$$

## Identification

• Under consistency, assignment rule, and continuity of potential outcomes
• $$\tau^{RD}:=E[Y^1-Y^0|R=c]$$ Local Average Treatment Effect at the threshold
• Proof:
• $$E[Y|R=r]=E[Y^1|R=r]1\{r\geq c\}+E[Y^0|R=r]1\{r<c\}$$ (consistency and assignment rule) so
• $$\tau^{RD}=\underset{e\searrow 0}{\lim}E[Y^1|R=c+e]-\underset{e\searrow 0}{\lim}E[Y^0|R=c-e]$$
• $$=E[Y^1|R=c]-E[Y^0|R=c]$$ (continuity)
• Lee and Lemieux (2010) show visually

## Intuition

• Estimated quantity is a treatment effect only conditional on being at the cutoff: marginal units only
• May or may not be representative of effect at other levels of $$R$$: patients just barely sick enough to get drug, etc
• Because conditioning group observable, can at least identify them and their attributes to see how they differ from other units
• Continuity says structural equation for $$Y$$ can depend on running variable, either directly or indirectly through associated confounders
• But magnitude of effects, at least locally near the cutoff, can’t jump in the way treatment does
• Bound on the degree of change in the outcome for small changes in $$R$$
• Maybe biological mechanisms via which metabolite associated with disease severity involve proportionate changes
• Especially reasonable when running variable is noisy or imperfect proxy for causally meaningful effects
• Test scores proxy for academic ability, but 1 point difference mostly noise, and in any case “real” ability difference probably not large enough to be meaningful
• Unreasonable when other things change discontinuously at cutoff
• Administrative borders: lots of laws, regulations, taxes change as soon as you enter an administrative unit
• A common reason for discontinuous changes at cutoff is that existence of cutoff itself influences behavior discontinuously
• If people can control their $$R$$, and they care about whether they are treated, they might do so
• To extent that preferences or ability to do this are heterogeneous, manipulation may induce differences in $$Y^x$$ on different sides of the jump
• This makes discontinuities in, e.g. tax and welfare code, likely a bad fit for RD (though manipulation may itself be informative)
• If manipulation is imprecise, maybe not a problem: suppose units can only choose $$r^*$$ but $$R=r^*+\epsilon$$ for continuously distributed independent $$\epsilon$$
• Then $$E[Y^x|R=r]=\int\int y^xf(y^x|R=r,\epsilon=e)dy^xf_\epsilon(e)de=\int\int y^xf(y^x|r^*=u)f_\epsilon(r-u)dy^xdu$$ which can be continuous in $$r$$ even if $$f(y^x|r^*=u)$$ is discontinuous
• E.g., if people can study for a test, but not learn answers exactly, still preserve continuity (though this does affect properties of marginal group)

## Aside: Local randomization

• Sometimes see RD model/method analyzed and described in terms of “local randomization” (Cattaneo, Frandsen, and Titiunik (2015))
• Near cutoff, whether you are on one side or another is “effectively random” and you can treat data “as if” an experiment near the cutoff
• Formalization: $$Y^{x,r}=Y^{x}$$ for $$r\in [\underline{c},\bar{c}]$$ $$\underline{c}<c<\bar{c}$$ and $$R\perp(Y^0,Y^1)|R\in [\underline{c},\bar{c}]$$
• In this situation, no effect of or association of running variable with outcomes in a window
• This is much stronger assumption than continuity
• Asks for potential outcomes functions to be exactly flat near the cutoff
• Maybe plausible if assignment rule based on variable not at all important to outcome
• Aside from deliberate random assignment, I struggle to think of cases where applicable
• But if you do have that, rdlocrand implements Fisher-style tests

## Estimation: Local regression

• Continuity is fairly weak notion formalizing small changes in $$R$$ leading to small changes in $$Y^x$$
• For every $$\epsilon>0$$, $$\exists \delta>0$$ s.t. $$d(r,c)<\delta\implies d(E[Y^x|R=r],E[Y^x|R=c])<\epsilon$$
• So, to get $$2\epsilon$$-accurate estimate of $$\tau^{RD}$$, estimate conditional expectation functions for $$r$$ within $$\delta$$
• Simple implementation is to take window of width $$\delta$$ and estimate expectations by conditional means
• $$\widehat{\beta}_0^{-}=\frac{\sum_{i=1}^{n}Y_i1\{c-\delta<R_i<c\}}{\sum_{i=1}^{n}1\{c-\delta<R_i<c\}}$$, $$\widehat{\beta}_0^{+}=\frac{\sum_{i=1}^{n}Y_i1\{c\leq R_i<c+\delta\}}{\sum_{i=1}^{n}1\{c\leq R_i<c+\delta\}}$$
• $$\widehat{\tau}^{RD}=\widehat{\beta}^{+}_0-\widehat{\beta}^{-}_0$$
• By law of large numbers and continuity (so long as density in region $$>0$$), holding fixed $$\delta$$,
• $$\widehat{\beta}_0^{-}\overset{p}{\to}\int E[Y|R=r]f(R=r|c-\delta<R<c)dr\in[E[Y^0|R=c]-\epsilon,E[Y^0|R=c]+\epsilon]$$
• Likewise for $$\widehat{\beta}_0^{+}$$, so $$\text{plim }\widehat{\tau}^{RD}$$ is in $$E[Y^1-Y^0|R=c]\pm2\epsilon$$
• This gets called the local constant or Nadaraya-Watson estimator
• Not bad: for any $$\epsilon$$, we have estimator that converges to something $$2\epsilon$$ close to true effect
• Problem 1: continuity doesn’t tell us anything about $$\delta$$ we need
• Solution: need stronger conditions quantifying $$\delta$$: differentiability, smoothness
• Problem 2: would like to converge to exact value
• Solution: shrink $$\delta$$ with sample size so that bias disappears
• Issue of what to assume about local response of conditional mean, and based on it, how much to shrink window, is main source of difficulty

## Estimation: Parametric Models

• What we actually need is an estimate of the conditional expectation function at a point
• We can get this from any consistent estimator
• OLS consistent, unbiased estimate of a conditional expectation function if correctly specified
• $$y=\beta_0+\tau1\{R\geq c\} + \beta_1 R_i +u$$: additive effect of treatment, linear in running variable
• $$y=\beta_0+\tau1\{R\geq c\} + \beta_1 (R_i-c)1\{R< c\}+\beta_2(R_i-c)1\{R\geq c\} +u$$: additive effect of treatment, linear in running variable with different slopes
• Problem with this is that if incorrectly specified, all you get is best linear predictor
• Average is with respect to integrated mean squared error over entire distribution of $$R_i$$
• Goal is estimate of $$\tau$$, but estimator adjusts slope to get smaller errors in regions far away
• Helpful to use this information if global shape known, as you can extrapolate from data anywhere in sample
• But still optimizes the wrong criterion for prediction at a point
• Early attempts to solve this suggested adding polynomials to make shape more flexible and reduce bias
• This is no longer recommended (Gelman and Imbens (2019)) due to potential to magnify bias
• With improved methods, parametric RD mainly relegated to situations where some nonstandard aspect of the analysis makes calculations with local methods difficult

## A happy medium: Local Polynomials

• The tried and true estimator for nonparametric RD
• Classical, easily interpretable, well-studied
• Idea: to estimate function near a point, use function estimation methods, but use only data near that point
• Choose a window around $$x$$ and run (weighted) regression in that window
• Account for slope and curvature and extrapolate locally by using polynomial of order $$p$$
• $$p=0$$ is local constant, $$p=1$$ is local linear, $$p=2$$ is local quadratic
• $$p=1$$ most common choice for RD, default in a lot of software
• Let $$K(u)$$ be a kernel function weighting points within window
• Boxcar/Uniform: $$K(u)=1\{|u|\leq1\}$$, Triangular: $$K(u)=\max\{0,(1-|u|)\}$$, Epanechnikov: $$K(u)=\max\{0,\frac{3}{4}(1-u^2)\}$$
• Choose how spread out to make the average by adjusting bandwidth $$h$$
• (One-sided) local polynomial regression estimators of order $$p$$ with bandwidth $$h$$
• $$\widehat{\beta}^{-}(r):=(\widehat{\beta}^{-}_0(r),\hat{\beta}^{-}_1(r),\ldots)=\underset{(\beta_0,\beta_1)}{\arg\min}\sum_{i=1}^{n}1\{R_i<c\}K(\frac{R_i-r}{h})(Y_i-\sum_{j=1}^{p}\beta_j(R_i-r)^j)^2$$
• $$\widehat{\beta}^{+}(r):=(\widehat{\beta}^{+}_0(r),\hat{\beta}^{+}_1(r),\ldots)=\underset{(\beta_0,\beta_1)}{\arg\min}\sum_{i=1}^{n}1\{R_i\geq c\}K(\frac{R_i-r}{h})(Y_i-\sum_{j=1}^{p}\beta_j(R_i-r)^j)^2$$
• $$\widehat{\tau}^{RD}=\widehat{\beta}^{+}_0(c)-\widehat{\beta}^{-}_0(c)$$
• With uniform kernel, this is just regression in a window

## Tuning parameters

• bandwidth $$h$$, kernel $$k(.)$$, polynomial order $$p$$
• Hard part is how to pick these and why
• Various concerns, assumptions, tradeoffs make this difficult and context-dependent
• Plenty of work on “optimal” choices, all of which uses different criteria and makes different assumptions
• You have to decide which assumptions and goals are relevant to your setting
• Statistical criteria (bias, variance, coverage, interval width, robustness, etc) well-quantified
• You also want your work to be transparent, impactful, trusted, and trustworthy
• Visualization is crucial: plot should allow users to assess results and their quality
• Possibility of specification search favors methods which are conventional
• Using “default” methods whenever possible: RD software often has fairly reasonable choices implemented, even if not exactly the ones you would have made
• Have clear and convincing explanations for modifications to defaults based on features of your setting
• Display results under range of plausible choices to demonstrate robustness

## Analysis: Rates

• An assumption under which local polynomial estimators are a good choice is differentiability
• More specifically, the quality of approximation of a Taylor expansion of order $$p$$
• Heuristic argument: consider local linear regression with uniform kernel and bandwidth $$h$$
• Assume $$E[Y|R=r]$$ is twice differentiable with second derivative bounded
• Taylor expansion is then $$E[Y|R=r]=f(r)=f(c)+f^\prime(c)(r-c)+\frac{1}{2}f^{\prime\prime}(\tilde{r})(r-c)^2$$ and $$Y_i=f(R_i)+u_i$$
• Normalizing $$c=0$$ and letting $$\vec{R}_i=(1,R_i,...,R_i^p)$$
• $$\widehat{\beta}_0(r)=e_1^\prime(\sum_{i=1}^{n}K(\frac{R_i}{h})\vec{R}_i\vec{R}_i^\prime)^{-1}\sum_{i=1}^{n}K(\frac{R_i}{h})\vec{R}_iY_i$$ $$=\sum_{i=1}^{n}w(R_i,h)Y_i$$ linear combination for some weight that doesn’t depend on $$Y$$
• $$E[E[\widehat{\beta}_0(0)|R]]-E[Y|R=0]=\frac{1}{2}E[\sum_{i=1}^{n}w(R_i,h)[f^{\prime\prime}(\tilde{R}_i)-f^{\prime\prime}(\tilde{r})]r^2]\leq Ch^2$$
• Because OLS is unbiased for linear functions and second derivatives and weights are bounded
• By uniform kernel and bandwidth $$h$$, density of $$R$$ bounded above and below in neighborhood of 0, number of observations given non-zero weight is proportional to $$nh$$
• Since OLS has variance proportional to inverse of sample size, variance is proportional to $$\frac{1}{nh}$$
• Mean squared error of estimate $$E[\widehat{\beta}_0(0)-E[Y|R=0])^2]=$$ $$\text{Bias}(\widehat{\beta}_0(0))^2+Var(\widehat{\beta}_0(0))$$ $$\propto h^4+\frac{1}{nh}$$
• Taking FOC to minimize yields $$h\propto n^{-1/5}$$ and MSE $$\propto n^{-4/5}$$: proportionality depends on kernel choice, $$R$$ distribution, smoothness bound
• Compare to correctly specified OLS, bias 0, variance = MSE $$\propto n^{-1}$$
• Can show this is optimal minimax rate for Hölder class $$\Lambda^2(C)$$ of functions with bounded second derivatives (Tsybakov (2009))
• $$\underset{\widehat{f}}{\inf}\underset{f\in\Lambda^2(C)}{\sup}MSE(\widehat{f},f)\propto n^{-4/5}$$, so estimate is pretty good

## Implementation Choices

• As matter of practice, constant is unknown, so we can’t choose optimal bandwidth
• Kernel also affects outcome, but enters into constant and not rate, so less important overall
• In this setting, for local linear regression estimator, Imbens and Kalyanaraman (2012) describe estimator for the constant term
• Relies on preliminary higher order local polynomial to get higher derivatives, variance parameters, etc
• This is the default in most older RD software and still probably a tolerable default for point estimates
• Cross-validation with MSE criterion also feasible
• That said, why use method optimal for exactly 2 derivatives, instead of more or less?
• Minimax MSE with $$d$$ derivatives $$\propto n^{\frac{-2d}{2d+1}}$$
• Probable reason: 2 derivatives is class for which local linear regression optimal
• Procedure simple, transparent, visualizable, has connections to familiar OLS
• Higher order rates require methods with complications or drawbacks
• Higher order kernels may have worrisome properties like some negative weights
• Higher order polynomials may overextrapolate (though reducing bandwidth should help with this)
• Adaptive estimators have serious advantage of achieving optimal rates without needing to know order
• But methods appear complicated, computationally challenging, and since not widely used, lack reliable software implementations and communities of practice to address possibly unintended sources of fragility
• In practice you are expected to report sensitivity analyses to bandwidth, and maybe order of polynomial and/or choice of kernel anyway
• If results are not that sensitive to choice, optimality probably not main concern

## Inference

• If goal is inference rather than point estimation, criterion changes entirely and so do procedures
• MSE equalizes weight placed on variance and squared bias, while inference prioritizes coverage
• With MSE-optimal bandwidth, local linear estimator converges in distribution to a normal at rate $$\sqrt{nh}$$
• $$\sqrt{nh}(\widehat{\beta}(r)-E[Y|R=r])=\sqrt{nh}(\widehat{\beta}(r)-E[\widehat{\beta}(r)])+\sqrt{nh}(E[\widehat{\beta}(r)]-E[Y|R=r])$$
• Because bias term is size $$h^2$$ and optimal $$h\propto n^{-1/5}$$, scaled bias term is $$O(1)$$: doesn’t disappear in large samples
• Limiting normal distribution has non-zero mean!
• Given that bias is unknown, can’t just rely on variance estimates
• Option 1: undersmoothing: choose smaller than MSE-optimal bandwidth $$h=o(n^{-1/5})$$
• Bias term then asymptotically negligible, at cost of slower MSE overall
• Intervals wider since $$\sqrt{nh}$$ scaling is slower if h is
• Asymptotically negligible doesn’t mean ignorable in small samples: may have poor coverage
• Bandwidth choice methods become inapplicable
• Option 2: use MSE-optimal bandwidth (as in Imbens and Kalyanaraman (2012)), but estimate and correct for for the bias term in CI
• Calonico, Cattaneo, and Titiunik (2014) propose an estimate, but note that bias correction itself has variability
• They therefore propose an adjusted variance estimator taking into account all terms
• With implementation in rdrobust R/Stata libraries and extensive documentation and extensions, may be most popular and widely applicable default implementation

## Impossibility/Optimality of Inference

• Above inference procedures assumed local polynomial estimation and sufficient derivatives to estimate conditional expectation and its derivatives, to compute bandwidth and bias corrections
• How do assumptions relate to properties of inference?
• May compare procedures by (expected) length of (valid) confidence intervals, or power of tests
• Under assumptions needed for identification of RD, no $$1-\alpha$$ confidence interval can have finite expected length (Bertanha and Moreira (2020))
• This is true even if we assume infinitely differentiable functions
• Manifestation of fact that without quantifying size, impossible to distinguish a discontinuity from a continuous and smooth but arbitrarily fast change
• Implication is that procedures like Calonico, Cattaneo, and Titiunik (2014) which estimate degree of smoothness must under-cover for some bad distributions in finite samples, even if for any fixed function they eventually converge
• Stronger model conditions can restore possibility of CIs with uniform coverage: an explicit upper bound on derivatives
• Armstrong and Kolesár (2018), Armstrong and Kolesár (2020) show that with upper bound $$p^{th}$$ derivative, can upper bound bias of local estimator over all functions in class obeying the bound, construct critical values which provide $$1-\alpha$$ coverage even in worst case
• If $$\widehat{\tau}^{RD}$$ has worst case bias $$B(\widehat{\tau}^{RD})$$ over class $$\mathcal{F}_M$$, and standard error $$\widehat{se}(\hat{\tau}^{RD})$$, CI is $$\widehat{\tau}^{RD}\pm cv_\alpha(B(\widehat{\tau}^{RD}/\widehat{se}(\hat{\tau}^{RD}))(\widehat{se}(\hat{\tau}^{RD}))$$ where $$cv_\alpha(b)$$ is critical value of $$|N(b,1)|$$ distribution
• Derive max bias for function classes such as assuming slope near cutoff bounded by $$M$$, or that $$|f(r)-f(c)-f^\prime(c)|\leq \frac{M}{2}(r-c)^2$$ for local linear with some choice of kernel and bandwidth, and other weighting estimators
• You provide your intuition on how large slope/curvature could be in worst case
• Show that optimal choices for CI length, power, MSE, all fairly similar in worst case, and little scope for improvement by adapting
• Implemented in R library RDHonest

## Alternate Perspective: Bayes

• Bayesian estimation provides a way to express substantive knowledge about features of model in terms of average case properties instead of worst case belief
• Express these as a prior distribution over functions on either side of cutoff
• To allow flexible shape with values near cutoff informed mainly by local data, use nonparametric prior like Gaussian Process (Williams and Rasmussen (2006))
• Belief about conditional mean at any points $$r_1,r_2,...$$ is multivariate Gaussian, with covariance based on distance
• Choices of covariance can express beliefs about smoothness and degree of information sharing
• Branson et al. (2019) describe implementation and properties for use in RD
• Uncertainty expressed as posterior distribution not same as a confidence interval, though can sometimes be shown to converge (Vaart and Zanten (2008))
• Properties encoded by prior (and likelihood) for shape of function may be subtle, so simulate to make sure prior focuses on plausible shapes
• Not widely used but probably should be…

## Visualization

• All RD estimates should be accompanied by a plot showing function estimates on both sides of the cutoff
• Include both local estimate and function estimates away from the cutoff, as well as intervals if possible
• In sparse data settings, these should overlay a scatterplot of $$(R_i,Y_i)$$ values to allow comparison of fit to raw data
• With larger sample sizes, may be hard to see much in data, so overlay binscatter instead (Cattaneo et al. (2021))
• Within narrow $$R$$ bins, compute average $$Y$$ value, plot for each bin, with width defined by levels or quantiles of $$R$$
• Usable as nonparametric estimator, but goal is a low-bias high-variance guide to patterns in data, so choose narrow bins that exhibit variability
• Example from Lee (2008) study of incumbency effect on re-election

## Eyeballing it: pros and cons

• Visual ability to distinguish slopes of curves from a scatterplot less precise than formal statistical estimates
• But when jump is visually detectable, result is much more impactful and likely to be trusted
• Clear visibility criterion prioritizes size over power: may “miss” results which are well-distinguished with data
• This is arguably a feature, not a bug: to extent that statistical methods rely on manipulable choices, they may “go away” with alternate assumptions on bandwidth, order, etc, and a conservative criterion raises robustness to that
• Visibility also allows assessing other features of data, like variability, curvature, degree of change in these features away from jump
• Expresses visual check on tuning parameter choices (does data start to curve near or far from jump) and “placebo” check on what things look like without a jump
• Visualization choices can impact how striking result appears: raw data very noisy, power low, parametric model very tight, doesn’t provide extra checks. In between are undersmoothed methods with low bias: binscatter width interpolates.
• Plot window including just area near cutoff (see it better) vs wider range (assess global properties)
• “Default” is to plot parametric fit and binscatter, with some default bin width, on area wider than window, cut off only if wide R support obscures area near jump
• May raise or lower this, but people may question your choice

## Features of an RD plot that indicate a questionable fit

• Binscatter outliers right near the jump: suggests overfitting
• Parametric curve deviating strongly from binscatter right near discontinuity:
• Suggests extrapolation is doing a lot of work:
• Especially bad with quadratic/cubic fits, which can extrapolate curvature from other part of the space to curvature even in another direction near the discontinuity
• Part of reason why Gelman and Imbens (2019) suggest avoiding RD using higher order polynomials entirely
• Gelman referred to following RD estimate of air pollution on mortality using geographic border where regulations change as a “regression discontinuity disaster”
• Your goal as an applied researcher is to make sure your work is never highlighted on Gelman’s blog…