- Causal effect of interest is often single feature of much more complicated model
- When identified, can write it as
*functional*of the distribution - \(\psi=\Psi(P):\ \mathcal{P}\to\mathbb{R}^d\)
- \(\mathcal{P} = \{\mathbb{P}_\theta:\ \theta\in\Theta\}\) is class of distributions, indexed by parameters, which may be infinite dimensional (expectation functions, densities, etc)

- When identified, can write it as
- What is reasonable estimator for \(\psi\) given data \(\{x_i\}_{i=1}^{n}\) with empirical distribution \(\mathbb{P}_n\)?

- \(\widehat{\theta}_{MLE}=\underset{\theta\in\Theta}{\arg\max}\log \ell(\{x_i\}_{i=1}^{n},\theta)\)
- \(\widehat{\psi}_{MLE}=\Psi(\mathbb{P}_{\widehat{\theta}_{MLE}})\)
- If \(\Theta\) finite dimensional, under some regularity conditions, asymptotically optimal to maximize likelihood, and plug in likelihood to estimate functional
- Limit theory review (we’ll use this form): Take FOC and Taylor expand

\[0=\frac{1}{n}\sum_{i=1}^{n}\frac{d}{d\theta}\log\ell(x_i,\widehat{\theta})\approx \frac{1}{n}\sum_{i=1}^{n}\frac{d}{d\theta}\log\ell(x_i,\theta_0)+\frac{1}{n}\sum_{i=1}^{n}\frac{d^2}{d\theta^2}\log\ell(x_i,\bar{\theta})(\widehat{\theta}-\theta_0)\] - Rearrange, apply LLN and Slutsky \[\sqrt{n}(\widehat{\theta}^{MLE}-\theta_0)= -\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(\mathbb{E}[\frac{d^2}{d\theta^2}\log\ell(x_i,\bar{\theta})])^{-1}\frac{d}{d\theta}\log\ell(x_i,\theta_0)+o_p(1)\] - Supposing \(\Psi\) differentiable in \(\theta\), apply first order Taylor expansion again (“Delta Method”) \[\sqrt{n}(\widehat{\psi}^{MLE}-\psi_0)= -\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{d\psi}{d\theta}(\mathbb{E}[\frac{d^2}{d\theta^2}\log\ell(x_i,\bar{\theta})])^{-1}\frac{d}{d\theta}\log\ell(x_i,\theta_0)+o_p(1)\]

- We say \(g_\psi(x_i)=\frac{d\psi}{d\theta}(\mathbb{E}[\frac{d^2}{d\theta^2}\log\ell(x_i,\bar{\theta})])^{-1}\frac{d}{d\theta}\log\ell(x_i,\theta_0)\) is
*influence function*of \(\widehat{\psi}\) - By CLT \(\sqrt{n}(\widehat{\psi}^{MLE}-\psi_0)\overset{d}{\to}N(0,E[g_\psi g_\psi^{\prime}])\)
- By the Cramér-Rao lower bound, this is smallest asymptotic variance of any regular estimator of \(\psi_0\)

Problem: we don’t have a correctly specified parametric model of full data distribution

Partition \(\Theta=(\theta,\beta)\) where \(\beta\) is

*nuisance*, which may be complicated\(\beta\) may be, eg, a conditional expectation function, a density, etc, that we don’t want to restrict

Can we avoid depending on \(\beta\) if we just care about \(\theta\)?

If \(\beta\) indexes (conditionally) independent part of model, often yes

- Conditional likelihood: \(\ell(y,x,\theta,\beta)=\ell(y|x,\theta)\ell(x,\beta)\) can just optimize first part

In more general case, can’t avoid some dependence on \(\beta\)

- Simple case: \(\beta\) estimable without knowing \(\theta\)
- Suppose we have an identifying formula \(E[\phi(x_i,\beta)]=\theta\)
- Suppose we have an estimator \(\widehat{\beta}\) of \(\beta\)
- Estimate \(\widehat{\theta}^{\text{plugin}}=\frac{1}{n}\sum_{i=1}^{n}\phi(x_i,\widehat{\beta})\)
- Ex: adjustment formula under conditional ignorability \(\theta=\int E[y|x,z]dP(z)\)
- \(\beta:=E[y|x,z]\) unknown function estimated by \(\widehat{\beta}(x)\)

- Suppose \(\Vert \widehat{\beta}(x)-\beta\Vert=O_p(g(n))\) and \(\widehat{\beta}\in\mathcal{B}\) w.p.a. 1
- Let \(\mathcal{F}=\{\phi(.,\beta):\ \beta\in\mathcal{B}\}\) satisfy a ULLN
- Define \(\partial_\beta f(\beta)|_{\beta_0}[\beta-\beta_0]=\frac{\partial}{\partial t}f(\beta_0+t(\beta-\beta_0))\) be the directional (Gateaux) derivative of a function \(f\).
- Suppose \(\partial_{r}E[\phi(x_i,\beta_0+r(\beta-\beta_0))][\beta-\beta_0]\leq C\Vert\beta-\beta_0\Vert\) derivative is uniformly bounded
- Then \[\widehat{\theta}^{\text{plugin}}-\theta_0\leq \underset{\beta\in\mathcal{B}}{\sup}\{\frac{1}{n}\sum_{i=1}^{n}\phi(x_i,\beta)-E[\phi(x_i,\beta)]\}+E[\phi(x_i,\widehat{\beta})]-\theta_0\]
- By ULLN \[= E[\phi(x_i,\beta_0)]+\int_0^1\partial_{r}E[\phi(x_i,\beta_0+r(\widehat{\beta}-\beta_0))]dr-\theta_0+o_p(1)\]
- By Taylor’s theorem with integral remainder \[\leq C\Vert\widehat{\beta}-\beta_0\Vert+o_p(1)=O_p(g(n))\]
- Result: convergence rate is same as that of \(\widehat{\beta}\)
- If estimate is nonparametric, convergence rate of \(\theta\) is nonparametric
- Even though \(\theta\) is finite dimensional, \(\beta\) may be function
- Typically not possible to get \(\sqrt{n}\) asymptotically normal inference

- Take half of data \(i=1\ldots N_1\) vs \(N_1+1\ldots N_2\)
- Estimate \(\widehat{\beta}^{N_1}\) in \(N_1\), Define \(\widehat{\theta}^{N_2}=\frac{1}{N_2}\sum_{i=N_1+1}^{N_2}\phi(x_i,\widehat{\beta}^{N_1})\)
- Above proof then goes through with ULLN replaced by standard LLN by independence
- ULLNs for nonparametric function classes a pain to verify, and rule out some estimators

- Lose half of sample size, but easy fix: switch split and estimate \(\widehat{\beta}^{N_2}\) and \(\widehat{\theta}^{N_1}\)
- Define \(\widehat{\theta}^{\text{crossfit}}=\frac{1}{2}(\widehat{\theta}^{N_1}+\widehat{\theta}^{N_2})\)
- Still get same slow rates, bias, etc

- Can estimate c.e.f. \(E[y|x]\) or density \(p(x)\) without functional form restrictions
- Nonparametric methods: use local information around \(x\)
- Minimal assumptions: smoothness or other structure
- Cost: convergence at slower rates than \(\sqrt{n}\)
- Google “scikit-learn tutorial” or cross street to ML department to learn about these
- Typical result \(E[(\widehat{f}(x)-E[y|x])^2]=O_p(n^{-\alpha})\) for some \(\alpha<1\) depending on smoothness, dimension, etc
- Note: \(MSE = Bias(\widehat{f})^2 + Var(\widehat{f})\) means bias not asymptotically negligible, and total error larger than CLT error
- Plug in generally converges at same, slow rate, and does not have estimable asymptotic distribution

- For conditional expectations \(E[y|x]\)
- \(x\) is moderate dimensional: random forests, xgboost, BART, factor models, Highly adaptive Lasso
- \(x\) is low dimensional but function strongly nonlinear: MARS, trend filtering, Lasso over wavelets
- \(x\) is image or video data: (convolutional) neural networks
- \(x\) is text data: Recurrent or transformer neural nets, word embeddings, Latent Dirichlet Allocation

- Many people like Lasso on tabular data because it “looks like” OLS and gives you coefficients
- Don’t listen to them: nobody knows how to interpret OLS coefficients and the predictions are terrible

- A small number of people like SVMs and RKHS methods
- These people are theorists who have never touched a real data set in their life

- Many people assume neural networks are best choice on everything
- Don’t listen to them: “vanilla” neural nets often not great on small data, shine on much bigger data sets

- For conditional distributions \(p(y|x)\): \(y\) discrete:
- Any estimator of \(E[y|x]\), plus regularized GLMs
- If goal is
*inverse*probability, consider estimating directly, via regularized GMM (Chernozhukov, Newey, and Singh (2021))- Benefit is likely better robustness to weak overlap, disadvantage is can’t do generic ML

- For conditional densities \(p(y|x)\): \(y\) continuous:
- \(x\) low dimensional: kernel density, series
- \(x\) moderate dimensional: Distributional random forests? This area is pretty open

- Actual advice: try a bunch of models, and create an “ensemble” by taking linear combination of them
- Fit ensemble weights by cross validation: “super-learner”
- Include diverse models, including simple parametric models, in list

*Neyman orthogonality*: Dependence of influence function on \(\beta\) is*locally*close to 0- May be achievable by defining new influence function \(\phi\)
- Projects derivative wrt \(\theta\) on orthogonal complement of span of derivative wrt \(\beta\)
- Projection may require estimation of additional nuisance parameter, \(\mu\)

- Letting \(\eta=(\mu,\beta)\), a score function \(\phi(x_i,\theta,\eta)\) is Neyman orthogonal if \(\partial_\eta E[\phi(x_i,\theta_0,\eta_0)][\eta-\eta_0]=0\)
- where \(\partial_\eta\) is the Gateaux (directional) derivative \(\partial_\eta f(\eta)|_{\eta_0}[\eta-\eta_0]=\frac{\partial}{\partial t}f(\eta_0+t(\eta-\eta_0))\)
- Implies that to first order, estimation error in \(\eta\) has no effect on moment
- Condition “debiases” plug-in by reducing dependence on unknown functions

- Finding a Neyman orthogonal score can be challenging, and we usually won’t be deriving it ourselves, so we’ll come back to how to find one after first describing what we can do with it
- See Chernozhukov et al. (2018), Hines et al. (2021), Chernozhukov et al. (2017) for elaboration and examples

- Focus on case where functional has “plug in” form and for which estimators of \(\eta\) exist.
- Avoids some ugly algebra and some even uglier technicalities
- Upshot is, either estimate nuisances at each fixed \(\theta\) value (Chernozhukov et al. (2018), Belloni et al. (2017)), or further modify influence function to avoid this (Kallus, Mao, and Uehara (2020)).

- Let \(\phi(x_i,\theta,\eta)=q(x_i,\eta)- \theta\) be an influence function for our target functional \(\theta\) with nuisance parameters \(\eta\) including \(\beta\).
- Split sample into \(I_1\) and \(I_2\). On \(I_1\) estimate \(\hat{\eta}^{(1)}\) using some method (maybe machine learning based) On \(I_2\) estimate \(\hat{\theta}^{(2)}=E_nq(x_i,\widehat{\eta}^{(1)})\)
- Switch \(I_1\) and \(I_2\) and repeat previous step, and average \(\hat{\theta}^{DML}=\frac{1}{2}\hat{\theta}^{(1)}+\frac{1}{2}\hat{\theta}^{(2)}\)
- Alternately, pool both and set \(\hat{\theta}\) to solve sample moment condition
- Variants: split \(>2\) times, take median rather than average, etc.

**Theorem**(Chernozhukov et al. (2018), mostly) If we have that

- \(Eq(x_i,\eta_0)=\theta_0\)
- \(\partial_{\eta}Eq(x_i,\eta_0)[\eta-\eta_0]=0\) (our influence function is “Neyman orthogonal”)
- \(\int_0^1\partial^2_{r}E[\phi(x_i,\theta,\eta_0+r(\widehat{\eta}-\eta_0))]\leq E[(\widehat{\eta}-\eta_0)^2]\) 2nd order remainder
- \(E[(\widehat{\eta}-\eta_0)^2]=o_p(n^{-1/2})\) MSE rate condition
- \(\phi(x_i,\theta,\eta^{\prime})-\phi(x_i,\theta,\eta)\leq L(x_i)\Vert\eta^{\prime}-\eta\Vert\) Moment is \(L\)-Lipschitz w.r.t. \(\eta\)

- Then \(\sqrt{n}(\widehat{\theta}-\theta)\overset{d}{\to}N(0,Var(\phi(x_i,\theta_0,\eta_0))\)

- Sample splitting reduces overfitting by ensuring that \(\hat{\eta}\) noise independent of \(\hat{\theta}\) estimation process
- Consider only fixed value of \(\hat{\eta}\) rather than fitting jointly same data

- Second order remainder satisfied exactly if moment is quadratic form, as in some common examples, usually reasonable otherwise
- Lipschitz condition ensures that convergence of nuisance ensures similarly rapid convergence of moment
- Neyman orthogonality requires choice of moment to identify correct \(\theta_0\) but avoid influence of \(\eta\)
- Rate condition: satisfied if nuisance functions \(\widehat{\eta}\) converge fast enough
- Needs a true function which is not too complicated to not admit a fast estimator
- Also needs an estimator achieving those rates: usual nonparametric methods still too slow

- Depending on form of remainder, rate may involve combination of rates for different subcomponents of \(\eta\)
- E.g. AIPW “double robustness” requires product of \(\mu\), \(\pi\) rates

- Moment form can be relaxed to \(E[\psi_a(x,y,z,\eta)\theta_0+\psi_b(x,y,z,\eta)]=0\)
- \(\theta_0=-\frac{E[\psi_b(x,y,z,\eta)]}{E[\psi_a(x,y,z,\eta)]}\)
- Still allows estimating \(\eta\) without first knowing \(\theta\), so can treat ML part as black box

- \(\sqrt{n}(\widehat{\theta}-\theta_0)=\sqrt{n}E_n\phi(x_i,\theta_0,\widehat{\eta})\)
- \(=\sqrt{n}E_n\phi(x_i,\theta_0,\eta_0)+\)
- \(\left(\sqrt{n}E_n[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)]-\sqrt{n}E[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)]\right)+\)
- \(\sqrt{n}E\phi(x_i,\theta_0,\widehat{\eta})\)

- \(=(a)+(b)+(c)\)
- Call these terms the
*influence function*term, the*empirical process*term, and the*remainder*term - \((a)\overset{d}{\to}N(0,Var(\phi(x_i,\theta_0,\eta_0)))\) by CLT and (1)
- This will be dominant term: limit as if we knew \(\eta_0\)

- \((c)= \sqrt{n}E\phi(x_i,\theta_0,\widehat{\eta})+\sqrt{n}\partial_{\eta}E\phi(x_i,\theta_0,\eta_0)[\eta-\eta_0]+\sqrt{n}\frac{1}{2}\int_0^1\partial^2_{r}E[\phi(x_i,\theta_0,\eta_0+r(\widehat{\eta}-\eta_0))]dr\)
- \(=0+0+\sqrt{n}\frac{1}{2}\int_0^1\partial^2_{r}E[\phi(x_i,\eta_0+r(\widehat{\eta}-\eta_0))]dr\)
- First term by moment definition (1), second by Neyman orthogonality (2)
- \(\leq \sqrt{n}*o_p(n^{-1/2})=o_p(1)\)
- By remainder condition (4) and rate condition (5)

- Term (b)\(=\sqrt{n}(E_n-E)[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)]\) is one where sample splitting used
- Without sample splitting, converges to 0 iff \(\mathcal{F}:=\{\phi(x_i,\theta_0,\eta)-\phi(x_i,\theta_0,\eta_0): \eta\in B_{\epsilon}(\eta_0)\}\) is
*Donsker*- See Van Der Vaart and Wellner (1996) for gory details
- Rough summary: holds if function class is “not too complex.” Category includes most models with finite number of parameters, nonparametric models of sufficient smoothness (>d/2 derivatives), well-behaved compositions.
- May not hold if function class is sample size dependent: doesn’t play well with adaptive estimators.

- With sample splitting, just need rate results
- Prove for \(\widehat{\theta}^{(2)}\), \(n=n_2\), rest follows by symmetry

**Lemma**(Kennedy, Balakrishnan, and G’Sell (2020) lemma 2): \((b)=O_P(E[(\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0))^2])\)**Proof**: \(E[E_n[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)]|\mathcal{I}_1]=E[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)|\mathcal{I}_1]=E[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)]\) by independence, so \(E[(b)|\mathcal{I}_1]=0\)- \(Var[(E_n-E)[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\eta_0)]|\mathcal{I}_1]=Var[E_n[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)]|\mathcal{I}_1]\)
- \(=\frac{1}{n}Var[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)|\mathcal{I}_1]\leq\frac{1}{n}E[(\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0))^2]\)
- By Chebyshev, \(\forall t>0\) \(P(\frac{(E_n-E)[\phi(x_i,,\theta_0\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)]}{\frac{1}{\sqrt{n}}E[(\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0))^2]^{1/2}}\geq t)=E[P(\frac{(E_n-E)[\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0)]}{\frac{1}{\sqrt{n}}E[(\phi(x_i,\theta_0,\widehat{\eta})-\phi(x_i,\theta_0,\eta_0))^2]^{1/2}}\geq t|\mathcal{I}_1)]\leq\frac{1}{t^2}\)
- For any \(\epsilon>0\) Set \(t=\frac{1}{\sqrt{\epsilon}}\) to bound probability by \(\epsilon\)

**Corollary**By Lipschitz condition and rate result, \((b)=o_p(1)\)

- How do we get moments that satisfy above conditions?
- Typically occurs when \(\phi(x_i,\theta,\eta)\) is the
*efficient influence function*of functional \(\theta\) - Recall that for parametric models with well-behaved likelihoods, smallest possible asymptotic variance is
*Cramér-Rao*lower bound- Roughly, if, \(\sqrt{n}(\hat{\psi}-\psi)\overset{d}{\to}N(0,V)\), \(V\geq E[g_\psi g_\psi^{\prime}]=\frac{d\psi}{d\theta^\prime} E[\frac{d}{d\theta}\log\ell(x_i,\theta_0)\frac{d}{d\theta^\prime}\log\ell(x_i,\theta_0)]^{-1}\frac{d\psi}{d\theta}\)
- Under some regularity conditions on \(\hat{\psi}\)

- For semiparametric models, with nuisance parameter \(\beta\in\mathcal{B}\), if \(\{\beta(t):\ t\in T\}\subset\mathcal{B}\) is a regular parametric submodel, Cramér-Rao lower bound should still hold
- Thus, the supremum over all parametric submodels of Cramér-Rao bounds should form a lower bound on the worst case variance of the semiparametric model
- A function \(\phi(x_i,\eta)\) is an
*efficient influence function*for \(\psi\) if \(E[\phi(x_i,\eta_0)^2]\) equals this worst case lower bound - In general, properties of such functions beyond the scope of this class: see Van der Vaart (2000)
- In parametric case, it is just influence function of MLE
- Nonparametric case: \(\mathcal{P}\) is the space of all possible distributions \(P\) of data and \(\psi=\Psi(P_0)\)
- When it exists, influence function is \(\frac{d\Psi((1-r)P_0+r\delta_x)}{dr}\), the Gateaux derivative of functional with respect to a point mass at a data point \(x\)
- Ichimura and Newey (2021), Hines et al. (2021) discuss how to perform these calculations and use to derive orthogonal moment conditions: generally nontrivial calculation

- Case with both infinite and finite components generally requires even more delicate methods (eg Chen and Santos (2018))
- Today: just go over a few others have derived, point you to literature

- AIPW formula for \(E[Y^x]\) by adjustment: \(\eta=(\mu,\pi)\), \(\phi()=\mu(x,Z_i)+\frac{(Y_i-\mu(x,Z_i))1\{X_i=x\}}{\pi(x|Z_i)}\)
- Shown to be Neyman orthogonal in adjustment lecture

- Functionals of regression \(\beta=E[Y|X,Z]\), \(\theta=E[m(Y,X,Z,\beta)]\) for some function \(m\) linear in \(\beta\)
- \(\eta=(\beta,\mu)\) for \(\mu\) s.t. \(E[m(Y,X,Z,s(X,Z))-\mu_0(X,Z)s(X,Z)]=0\) for any \(s(X,Z)\)
- Orthogonal moment is \(E[m(Y,X,Z,\beta)+\mu(X,Z)(Y-\beta(X,Z))]-\theta=0\)
- Chernozhukov, Newey, and Singh (2021) suggest (regularized) GMM over set of test functions \(\{s_j()\}_{j=1}^J\) to estimate \(\mu\), ML regression to estimate \(\beta\)
- Special cases of m: \(m=\beta(x,Z)\) returns ATE \(m=w(X)\frac{d}{dX}\beta(X,Z)\) gives weighted average derivative
- In ATE case, above is just AIPW with sieve GMM for the inverse propensity

- Farrell, Liang, and Misra (2021) give more general formula for initial nuisance function defined by M-estimator
- Let \(\beta_0(z):=\underset{\beta\in\mathcal{B}}{\arg\min}E[\ell(Y,X,\beta(Z))]\) for a loss function \(\ell\),
- Let \(\psi_0=E[H(Z,\beta_0(Z),x^*)]\) be target functional for known \(H\)
- \(\mu(z)=E[\ell_{\beta\beta}(y,t,\beta(z))|Z=z]\) is additional function to estimate
- Then \(H(z,\beta_0(z),x^*)-H_\beta(z,\beta_0(z),x^*)\mu(z)^{-1}\ell_\beta(y,x,\beta(z))-\psi_0\) is a Neyman-orthogonal score
- Show cases for logit, Tobit, etc and suggest neural network first stage

- “Sufficient statistics approach” Chetty (2009) uses envelope theorem to note that policy effects near optimum have derivative 0 wrt structure parameters
- Optimal policy estimates are effectively Neyman orthogonal semiparametric estimates
- Debiasing approach may obtain similar robustness to nuisance features in more general economic settings

- \(y = \theta x + \beta(z) + \epsilon\)
- \(E[\epsilon|x,z]=0\), \(\theta\in\mathbb{R}\), \(\beta(z)\) unrestricted function
- \(E[y|z] = \theta E[x|z] + \beta(z)\)
- \(y-E[y|z] = \theta(x - E[x|z]) +\epsilon\)

- \(\eta = (\ell(z),m(z)):=(E[y|z],E[x|z])\) are
*nuisance functions*: can learn without knowing \(\theta\) - Score function is \(\psi(y,x,z,\eta,\theta)=(x-E[x|z])^2\theta+(y-E[y|z])(x-E[x|z])\)
- \(\partial_{(\ell,m)} E[\psi(x,z,\eta,\theta)]=(E[E[x|z]-x],2\theta E[E[x|z]-x]+E[E[y|z]-y])=(0,0)\) so Neyman orthogonal

- \(\theta= \frac{E[(y-E[y|z])(x-E[x|z])]}{E[(x-E[x|z])^2]}\)
- Interpretation: regress \(y\) and \(x\) on \(z\), then regress residuals
- If \((\ell(z),m(z))\) are linear functions of \(z\), this is exactly Frisch-Waugh-Lowell decomposition
- \(\widehat{\theta}^{OLS}=(((M_zX)^{\prime}(M_zX))^{-1}((M_zX)^{\prime}M_zY)\)

- Visualization: plot \(y_i-\widehat{\ell}(z_i)\) against \(x_i-\widehat{g}(z_i)\)
- Allows evaluation of linearity of \(x\) effect, visually displays uncertainty

- Implementation in
`DoubleML`

in R/Python- Nuisances by any machine learning algorithm in
`mlr3`

/`scikit-learn`

- Nuisances by any machine learning algorithm in
- Lasso version in
`hdm`

, kernel version (no crossfitting) in`npplreg`

in`np`

- Simulated data
- \(d_i = m_0(x_i) + s_1 v_i\), \(y_i = \alpha d_i + g_0(x_i) + s_2 \zeta_i\)
- \(m_0(x_i) = a_0 x_{i,1} + a_1 \frac{\exp(x_{i,3})}{1+\exp(x_{i,3})}\), \(g_0(x_i) = b_0 \frac{\exp(x_{i,1})}{1+\exp(x_{i,1})} + b_1 x_{i,3}\)
- \(a_0=1, a_1=0.25, s_1=1, b_0=1, b_1=0.25, s_2=1\)

```
library(DoubleML) #Cross-fitting
library(mlr3) #ML algorithms and fitting framework
library(mlr3learners) #Extra ML algorithms
library(data.table) #data format this needs for some reason
library(ggplot2) #Plotting software
lgr::get_logger("mlr3")$set_threshold("warn")
set.seed(1111)
```

```
#Choose Random Forest for nuisance function estimates
learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_g = learner$clone() #Use random forest for prediction of outcome
ml_m = learner$clone() #use random forest for prediction of treatment
#Simulate data from partially linear model with effect 0.5, 20 nuisance predictors
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='data.table')
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
#Partially linear regression with cross-fitting
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_g, ml_m)
dml_plr_obj$fit(store_predictions=TRUE)
print(dml_plr_obj)
```

```
#Choose Random Forest for nuisance function estimates
learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5)
ml_g = learner$clone() #Use random forest for prediction of outcome
ml_m = learner$clone() #use random forest for prediction of treatment
#Simulate data from partially linear model with effect 0.5, 20 nuisance predictors
data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=500, dim_x=20, return_type='data.table')
obj_dml_data = DoubleMLData$new(data, y_col="y", d_cols="d")
#Partially linear regression with cross-fitting
dml_plr_obj = DoubleMLPLR$new(obj_dml_data, ml_g, ml_m)
dml_plr_obj$fit(store_predictions=TRUE)
print(dml_plr_obj)
```

```
## ================= DoubleMLPLR Object ==================
##
##
## ------------------ Data summary ------------------
## Outcome variable: y
## Treatment variable(s): d
## Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20
## Instrument(s):
## No. Observations: 500
##
## ------------------ Score & algorithm ------------------
## Score function: partialling out
## DML algorithm: dml2
##
## ------------------ Machine learner ------------------
## ml_g: regr.ranger
## ml_m: regr.ranger
##
## ------------------ Resampling ------------------
## No. folds: 5
## No. repeated sample splits: 1
## Apply cross-fitting: TRUE
##
## ------------------ Fit summary ------------------
## Estimates and significance testing of the effect of target variables
## Estimate. Std. Error t value Pr(>|t|)
## d 0.48562 0.04127 11.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
Vhat<-data$d-dml_plr_obj$predictions$ml_g
Zetahat<-data$d-dml_plr_obj$predictions$ml_g
residuals<-data.frame(Vhat,Zetahat)
ggplot(data=residuals,aes(x=Vhat,y=Zetahat))+
geom_point(alpha=0.5)+
geom_smooth(method="lm",formula="y~x")+
ggtitle("Y Residuals vs X residuals",subtitle = "Line fit by OLS on full sample")
```

- Case where main effect nonparametric but still additive is a
*Generalized Additive Model*- \(y=f(x)+\sum_j g_j(z)+u\)
`mgcv`

,`gam`

use splines for each nonparametric component: see https://noamross.github.io/gams-in-r-course/- Gain flexibility, visualization of components, but still restricts functional form
- Lose \(\sqrt{n}\)-asymptotic normality

- Full generality requires interactions \(y=g(x,z)+\epsilon\)
- For discrete \(x\), use AIPW (
`DoubleMLIRM`

(for “Interactive Regression Model”) with`score="ATE"`

in`DoubleML`

)

- For discrete \(x\), use AIPW (
- For interactions and continuous \(x\), plugin possible but rates suffer curse of dimensionality
- Even worse, standard nonparametric and ML methods doesn’t let you single out \(x\)
- Result is your smoother or regularizer might create large bias \(x\) to

- One possibility: pick out low dimensional summary of \(x\) effect
- e.g. Weighted average: \(\int\int w(x)f(x,z)dxdz\), weighted average derivative: \(\int\int w(x)\partial_xf(x,z)dxdz\)
- Semiparametric estimator with robust moments can remain consistent

- Projection onto parametric model: find criterion function \(\hat{\theta}\) minimizes
- Derive efficient influence function for minimizer under nonparametric model (eg via Gateaux derivative method)
- Allows arbitrary misspecification, interpretation as best (partially) linear predictor

- In between: find semiparametric estimate of average kernel-weighted function \(\int\int K(\frac{x-u}{h})f(x,z)dxdz\)
- Then take bandwidth to 0 to get semiparametric kernel estimator
- Approach taken in Colangelo and Lee (2020), Kennedy et al. (2017),
- Lose \(\sqrt{n}-\) asymptotic normality, but consistent and faster than plugin

```
library(dplyr)
xvars<-dplyr::select(data,starts_with("X"))
```

- Applying Kennedy et al. (2017) kernel-based estimator to data from above experiment, find noisier and less precise but still reasonable estimate of average potential outcome

```
library(npcausal)
kernreg<-ctseff(data$y,data$d,xvars,bw.seq = seq(.5, 2, length.out = 10),
sl.lib=c("SL.mean","SL.ranger"))
plot.ctseff(kernreg)+title("Kernel Estimate of Average Potential Outcome ")
```

`## integer(0)`

- Papers using debiased influence function + sample splitting approach now come out at too fast a pace to follow
- Take your favorite model with unknown function as a component
- Replace that by nonparametric function
- Derive efficient influence function for low-dimensional target

- Still no fully automated way to do this, which is why every application gets a new paper
- Sieve GMM approach of Ai and Chen (2003) (and subsequent works by Chen et al) probably most general framework, if you can express your model as conditional GMM
- Original version requires joint GMM estimation of nuisance functions and parameters, so imposes specific (basis-function-based) nonparametric estimators
- Follow-ups allow multi-step estimates, with other estimators for nuisance functions
- Verifying regularity conditions and deriving variance formulas may be nearly as challenging as deriving influence functions from scratch…

- Example double ML applications
- IV, DiD, missing data, mediation, selection, (dynamic) discrete choice, quantile regression, density estimation, survival models, (policy and value functions in) reinforcement learning, online estimation, (almost?) any effect in NPSEM-IE models, etc
- If model widely used, somebody has done it or is working on it

- Semiparametric methods allow precise estimation and inference for low-dimensional objects with minimal assumptions on shape of high dimensional objects
- In causal inference, usually summary of treatment, with nuisance given by shape of other relationships
- Parametric estimators are generally inconsistent if incorrectly specified
- Plug in nonparametric estimators pass on slow rates and bias to low-dimensional summaries
- Use of orthogonal moments reduces sensitivity to estimation error
- Usually obtain \(\sqrt{n}-\) asymptotic normality with \(o_p(n^{-1/4})\) rates for nuisances

- Wide variety of machine learning estimators can estimate conditional means, etc at these rates

- Sample splitting can reduce overfitting and dependence on exact structure of nuisance estimator beyond rate
- Obtain orthogonal moments by calculating influence functions

Ai, Chunrong, and Xiaohong Chen. 2003. “Efficient Estimation of Models with Conditional Moment Restrictions Containing Unknown Functions.” *Econometrica* 71 (6): 1795–1843.

Belloni, Alexandre, Victor Chernozhukov, Iván Fernández-Val, and Christian Hansen. 2017. “Program Evaluation and Causal Inference with High-Dimensional Data.” *Econometrica* 85 (1): 233–98.

Chen, Xiaohong, and Andres Santos. 2018. “Overidentification in Regular Models.” *Econometrica* 86 (5): 1771–1817.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. 2017. “Double/Debiased/Neyman Machine Learning of Treatment Effects.” *American Economic Review* 107 (5): 261–65. https://doi.org/10.1257/aer.p20171038.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/debiased machine learning for treatment and structural parameters.” *The Econometrics Journal* 21 (1): C1–68. https://doi.org/10.1111/ectj.12097.

Chernozhukov, Victor, Whitney K Newey, and Rahul Singh. 2021. “Automatic Debiased Machine Learning of Causal and Structural Effects.” http://arxiv.org/abs/1809.05224.

Chetty, Raj. 2009. “Sufficient Statistics for Welfare Analysis: A Bridge Between Structural and Reduced-Form Methods.” *Annu. Rev. Econ.* 1 (1): 451–88.

Colangelo, Kyle, and Ying-Ying Lee. 2020. “Double Debiased Machine Learning Nonparametric Inference with Continuous Treatments.” *arXiv Preprint arXiv:2004.03036*.

Farrell, Max H, Tengyuan Liang, and Sanjog Misra. 2021. “Deep Learning for Individual Heterogeneity.” *arXiv Preprint arXiv:2010.14694*.

Hines, Oliver, Oliver Dukes, Karla Diaz-Ordaz, and Stijn Vansteelandt. 2021. “Demystifying Statistical Learning Based on Efficient Influence Functions.” *arXiv Preprint arXiv:2107.00681*.

Ichimura, Hidehiko, and Whitney Newey. 2021. “The Influence Function of Semiparametric Estimators.” *Quantitative Economics*.

Kallus, Nathan, Xiaojie Mao, and Masatoshi Uehara. 2020. “Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond.” http://arxiv.org/abs/1912.12945.

Kennedy, Edward H, Sivaraman Balakrishnan, and Max G’Sell. 2020. “Sharp Instruments for Classifying Compliers and Generalizing Causal Effects.” *The Annals of Statistics* 48 (4): 2008–30.

Kennedy, Edward H, Zongming Ma, Matthew D McHugh, and Dylan S Small. 2017. “Non-Parametric Methods for Doubly Robust Estimation of Continuous Treatment Effects.” *Journal of the Royal Statistical Society: Series B (Statistical Methodology)* 79 (4): 1229–45.

Van der Vaart, Aad W. 2000. *Asymptotic Statistics*. Vol. 3. Cambridge university press.

Van Der Vaart, Aad W, and Jon Wellner. 1996. *Weak Convergence and Empirical Processes: With Applications to Statistics*. Springer Science & Business Media.