David Childers on David Childers
/
Recent content in David Childers on David Childers
Hugo  gohugo.io
enus
© 2018
Sun, 15 Oct 2017 00:00:00 0400

Top Papers 2020
/post/toppapers2020/
Wed, 30 Dec 2020 00:00:00 +0000
/post/toppapers2020/
<p>The following is a look back at my reading for 2020, identifying a totally subjective set of the top 10 papers I read this year. My reading patterns, as usual, have not been so systematic, so if your brilliant work is missing it either slipped past my attention or is living in an everexpanding set of folders and browser tabs on my toread list. I’ll exclude papers I refereed, for privacy purposes (a fair amount if you include conferences and also cutting out a lot of the macroeconomics from my list). Themes I focused on were Bayesian computation, the optimal policy estimation/dynamic treatment regime/offline reinforcement learning space, and survival/point process models, all moreorless projectrelated and in all of which I’m sure I’m missing some foundational understanding. I spent a brief time in March mostly reading about basic epidemiology, which I am led to believe many others did as well, but didn’t take it anywhere.</p>
<p>Papers, in alphabetical order</p>
<ul>
<li>Adusumilli, Geiecke, Schilter. <a href="https://arxiv.org/abs/1904.01047">Dynamically optimal treatment allocation using reinforcement learning</a>
<ul>
<li>Approximation methods for estimating viscosity solutions of HJB equations and their resulting optimal policies policies from data. These methods will form a key step in taking continuous time dynamic macro models (see <a href="https://benjaminmoll.com/lectures/">Moll lecture notes</a>) to data.</li>
</ul></li>
<li>Andrews & Mikusheva <a href="https://scholar.harvard.edu/iandrews/publications/optimaldecisionrulesweakgmm">Optimal Decision Rules for Weak GMM</a>
<ul>
<li>The Generalized Method of Moments defines a semiparametric estimator implicitly, making it often unclear what the form of the nuisance parameter being ignored actually is, especially in cases of irregular identification. This paper takes a middle ground between the fully Bayesian semiparametric approach which puts a (usually Dirichlet Process) prior over the infinite dimensional nuisance space and the regular frequentist approach which ignores it entirely, showing weak convergence to a Gaussian Process, which is tractable enough to characterize and apply to obtain approximate Bayesian tests and decision rules without strong identification.</li>
</ul></li>
<li>Cevid, Michel, Bühlmann, & Meinshausen <a href="https://arxiv.org/abs/2005.14458">Distributional Random Forests : Heterogeneity Adjustment and Multivariate Distributional Regression</a>
<ul>
<li>Conditional density estimation by random forests with splits by (approximate) kernel MMD distribution tests. Produces a set of conditional weights that can be used to represent and visualize possibly multivariate conditional distributions. An <a href="https://github.com/lorismichel/drf">R package</a> is available and this really quickly became one of my goto data exploration tools.</li>
<li>See also: Lee and Pospisil have a <a href="https://github.com/tpospisi/rfcde">related method</a> splitting by sieve <span class="math inline">\(L^2\)</span> distance tests which is more or less similar, though more tailored to low dimensional outputs.</li>
</ul></li>
<li>Gelman, Vehtari, Simpson, Margossian, Carpenter, Yao, Kennedy, Gabry, Bürkner, Modrák. <a href="http://arxiv.org/abs/2011.01808">Bayesian Workflow</a>
<ul>
<li>A comprehensive overview of what Bayesian statisticians actually do when analyzing data, as opposed to the mythology in our intro textbooks (roughly, the likelihood is given to you by God, you think real hard and come up with a prior, then you apply Bayes rule and are done). It includes all the bits of sequential model expansion and checking and computational diagnostics and compromise between simplicity, convention, and domain expertise you actually go through to build a Bayesian model from scratch. The contrarian in me would love to see more frequentist analysis of this paradigm. A lot of the checks are there to make sure you’re not fooling yourself; how well do they work in practice?</li>
<li>See also Michael Betancourt’s <a href="https://betanalpha.github.io/writing/">notebooks</a> for worked examples of this process.</li>
</ul></li>
<li>Giannone, Lenza, Primiceri <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2018.1483826">Priors for the Long Run</a>
<ul>
<li>Exact rank constraints for cointegration are often uncertain, making pure VECM modeling a bit fraught, but standard priors on the VAR form are not strongly constraining of long run relationships, and improper treatment of initial conditions can lead to spurious inference on trends. This proposes a simple class of priors which allow “soft” constraints.</li>
</ul></li>
<li>Kallus and Uehara <a href="http://arxiv.org/abs/1908.08526">Double Reinforcement Learning for Efficient OffPolicy Evaluation in Markov Decision Processes</a>
<ul>
<li>Characterizes the semiparametric efficiency bound for the value of a dynamic policy and provides a doubly robust estimator combining the appropriate variants of a regression statistic and a (sequential) probability weighting statistic, allowing use of nonparametric and (with sample splitting) machine learning estimates in reinforcement learning while retaining parametric convergence rates.</li>
<li>See also companion papers on estimating the <a href="https://arxiv.org/abs/2002.04014">policy and policy gradient</a> and extending to the case of <a href="http://arxiv.org/abs/2006.03900">deterministic policies</a> (which require smoothing) among others, or watch <a href="https://www.youtube.com/watch?v=n5ZoxT_WmHo">the talk</a> for an overview.</li>
</ul></li>
<li>Sawhney & Crane <a href="https://dl.acm.org/doi/abs/10.1145/3386569.3392374">Monte Carlo Geometry Processing: A GridFree Approach to PDEBased Methods on Volumetric Domains</a>
<ul>
<li>I don’t usually read papers in computer graphics, but I do care a lot about computing <a href="https://donskerclass.github.io/post/whylaplacians/">Laplacians</a> and this paper offers a clever new Monte Carlo based method that allows computation on much more complicated domains. It’s not yet obvious to me whether the method generalizes to the PDE classes I and other macroeconomists <a href="https://benjaminmoll.com/wpcontent/uploads/2019/07/PDE_macro_translated.pdf">tend to work with</a>, but even if not it should still be handy for many applications.</li>
</ul></li>
<li>Schmelzing <a href="https://www.bankofengland.co.uk/workingpaper/2020/eightcenturiesofglobalrealinterestratesrgandthesupraseculardecline13112018">Eight Centuries of Global Real Interest Rates, RG, and the ‘Suprasecular’ Decline, 1311–2018</a>
<ul>
<li>An enormous data collection process and public good which will be informing research on interest rates for years to come. As with any such effort at turning messy historical data into aggregate series, many contestable choices go into data selection, standardization, and normalization, and I don’t think the author’s simple trend estimates of a several hundred year decline will be the last word on the statistical properties or future implications of this data, but now that it’s out there we have a basis for discussion and testing.</li>
<li>See also: lots of useful historical macro data collection (going not quite so far back) by the folks at the Bonn <a href="http://www.macrohistory.net/">Macrohistory Lab</a>.</li>
</ul></li>
<li>Wolf <a href="https://www.aeaweb.org/articles?id=10.1257/mac.20180328">SVAR ( Mis ) Identification and the Real Effects of Monetary Policy</a>
<ul>
<li>A nice practical application of Bayesian model checking, applying SVAR methods to simulated macro data when the (usually a bit suspect) identifying restrictions need not hold exactly. It finds that early signrestricted BVARs with uniform (Haar) priors tend to be biased toward finding monetary neutrality, and do not in fact provide noteworthy evidence contradicting the implied shock responses of typical central bank monetary DSGEs. Of course, such models have many other problems and not being contradicted by one test is not dispositive, but macro debates would be elevated if people would check to make sure that their contradictory evidence is in fact contradictory (respectively, supportive).</li>
</ul></li>
<li>Wang and Blei <a href="http://arxiv.org/abs/1905.10859">Variational Bayes under Model Misspecification</a>
<ul>
<li>Describes what (mean field) variational Bayes ends up targeting, at least in cases where a Bernstein von Mises approximation works well enough. Also covers the much more nontrivial case with latent variables.</li>
<li>I will judiciously refrain from comment on other recent works by this pair (discussion <a href="https://casualinfer.libsyn.com/fairnessinmachinelearningwithsherriroseepisode03">1</a>, <a href="https://casualinfer.libsyn.com/episode15drbetsyogburn">2</a>) except to say that dimensionality reduction in causal inference deserves more study and this <a href="https://drive.google.com/file/d/1aN1cK_UEffkBT34a2aNtrZKIJFw_xibX/view?usp=sharing">manifold learning approach</a> to create a nonparametric version of interactive fixed effects estimation looks like a useful supplement the standard panel data toolbox.</li>
</ul></li>
</ul>

Some issues with Bayesian epistemology
/post/someissueswithbayesianepistemology/
Sat, 05 Sep 2020 00:00:00 +0000
/post/someissueswithbayesianepistemology/
<p>In this post, I’d like to lay out a few questions and concerns I have about Bayesianism and Bayesian decision theory as a <em>normative</em> theory of inductive inference. As a positive theory, of what people do, psychology is full of demonstrations of cases where people do not use Bayesian reasoning (the entire “heuristics and biases” area), which is interesting but not my target. There are no new ideas here, just a summary of some old concerns which merit more consideration, and not even necessarily the most important ones, which are better covered elsewhere.</p>
<p>My main concerns are, effectively, computational. As I understand computer science, the processing of information <em>requires real resources</em>, (mostly time, but also energy, space, etc) and so any theory of reasoning which <em>mandates</em> isomorphism between statements for which computation is required to demonstrate equivalence is effectively ignoring real costs that are unavoidable and so must have some impact on decisions. Further, as I understand it, there is no way to get around this by simply adding this cost as a component of the decision problem.<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a> The problem here is that determination of this cost and reasoning over it is also computationally nontrivial, and so the costs of this determination must be taken into account. But determining these is also costly, ad infinitum. It may be the case that there is some way around this infinite regress problem via means of some kind of fixed point argument, though it is not clear that the limit of such an argument would retain the properties of Bayesian reasoning.</p>
<p>The question of these processing costs becomes more interesting to the extent that they are quantitatively nontrivial. As somebody who spends hours running and debugging MCMC samplers and does a lot of reading about Bayesian computation, my takeaway from this literature is that the limits are fundamental. In particular, there are classes of distributions such that the Bayesian update step is hard, for a variety of hardness classes. This includes many distributions where the update step is NP complete, so that our best understanding of P vs NP suggests that the time to perform the update can be exponential in the size of the problem (sampling from spin glass models is an archetypal example, though really any unrestricted distribution over long strings of discrete bits will do). I suppose a kind of trivial example of this is the case with prior mass 1, in which case the hardness reduces to the hardness of the deterministic computation problem, and so encompasses every standard problem in computer science. More than just exponential time (which can mean use of time longer than the length of the known history of the universe for problems of sizes faced practically by human beings every day, like drawing inferences from the state of a high resolution image), some integration problems may even be uncomputable in the Turing sense, and so not just wildly impractical but impossible to implement on any physical substrate (at least if the ChurchTuring hypothesis is correct). Amusingly, this extends to the problem above of determining the costs of practical transformations, as determining whether a problem is computable in finite time is itself the classic example of a problem which is not computable.</p>
<p>So, exact Bayesianism for all conceivable problems is physically impossible, which makes it slightly less compelling as a normative goal. What about approximations? This will obviously depend on what counts as a reasonable approximation; if one accepts the topology in which all decisions are equivalent, then sure, “approximate” Bayesianism is possible. If one gets to stronger senses of approximation, such as requiring computation to within a constant factor, for cases where this makes sense, there are inapproximability results suggesting that for many problems, one still has exponential costs. Alternately, one could think about approximation in the limit of infinite time or information; this then gets into the literature on Bayesian asymptotics, though I guess with the goal exactly reversed. Rather than attempt to show Bayes converges to a fixed truth in the limit, one would try to show that a feasible decision procedure converges to Bayes in the limit. For the former goal, impossibility results are available in the general case, with positive results, like Schwartz’s theorem and its quantitative extensions ( <a href="http://www.math.leidenuniv.nl/~avdvaart/BNP/">notes</a> and <a href="https://www.cambridge.org/core/books/fundamentalsofnonparametricbayesianinference/C96325101025D308C9F31F4470DEA2E8">monograph</a>) relying on compactness conditions that are more or less unsurprising given what is known on minimax lower bounds from information theory on what cannot be learned in a frequentist sense. For the other direction (whatever that might mean), I don’t know what results have been shown, though I expect, given the computational limitations in worst case priorlikelihood settings, that no universally applicable procedure is available.</p>
<p>How about if we restrict our demands from Bayesianism, for any prior and likelihood to Bayesian methods for some reasonable prior or class of priors? In small enough restricted cases, this seems obviously feasible: we can all do BetaBernoulli updating at minimal cost, which is great if the only information we ever receive is a single yes no bit. If we want Bayesianism to be a general theory of logical decision making, it probably has to go beyond that. Some people like the idea of <a href="https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_inductive_inference">Solomonoff induction</a>, which proposes Bayesian inference with a “universal prior” over all <em>computable</em> distributions, avoiding the noncomputability problem in some sense. This proposes a prior mass on all programs exponentially decreasing in their length expressed in bits in the Kolmogorov complexity sense. Aside from the problem that it runs into computational hardness results for determining Kolmogorov complexity and so is not itself computable, running into the above issues again, there are some additional questions.</p>
<p>This exponentially decreasing tail condition seems to embed the space of all programs into a hyperrectangle obeying summability conditions sufficient to satisfy Schwartz’s theorem for frequentist consistency of Bayesian inference. Hyperrectangle priors are fairly well studied in Bayesian nonparametrics: lower bounds are provided by the <a href="https://www.stat.berkeley.edu/~binyu/summer08/yu.assoua.pdf">Assouad’s lemma</a> construction and upper bounds are known and in fact reasonably small: by Brown and Low’s <a href="https://projecteuclid.org/euclid.aos/1032181159">equivalence results</a>, they are equivalent to estimation of a Höldersmooth function, for which an appropriately integrated Brownian motion prior provides nearminimax rates. This seems to be saying that universal induction as a frequentist problem is slightly easier than one of the easier single problems in Bayesian nonparametrics. This seems… a little strange, maybe. One way to look at this is to accept, and say that the infinities contemplated by nonparametric inference are the absurd thing, or to marvel that a simple Gaussian Process regression is at least as hard as understanding all laws deriving the behavior of all activity in the universe and be grateful that it only takes cubic time. The other alternative is to suggest that this implies that the prior, while covering the entire space in principle, is satisfying a tightness condition that is so restrictive that it effectively restricts you to a topologically meager set of programs, ruling out in some sense the vast majority of the entire space (this sense is that of <a href="https://en.wikipedia.org/wiki/Baire_category_theorem">Baire category</a>) in the same way that any two Gaussian process priors with distinct covariance functions are mutually singular. In this sense, it is an incredibly strong restriction and hard to justify ex ante, certainly at least as contestable as justifying an exante fixed smoothing parameter for your GP prior. (If you’ve ever run one of these, you know that’s a dig: people make so many ugly blurry maps.)</p>
<p>Alternatives might arise in fixed tractable inference procedures, or the combination of tractable procedures and models, though all of these have quite a few limitations. MCMC has the same hardness problems as above if you ask for “eventual” convergence, and fairly odd properties if run up to a fixed duration (including nondeterministic outcomes and a high probability that those outcomes exhibit various results often called logical fallacies or biases, which I suppose is not surprising since common definitions of biases or fallacies appear to essentially require Bayesian reasoning to begin with.) Variational inference likewise has these issues: even with the variational approximation, note that optimization to reach the modified objective may still be costly or even arbitrarily hard in some cases. Various neuroscientists seem to have taken up <a href="https://en.wikipedia.org/wiki/Free_energy_principle">some forms of variational inference</a> as a descriptive model of brain activity. Without expertise in neuroscience, I will leave well enough alone and say it seems like something that merits further empirical exploration. But as somebody who runs variational inference in practice, with mixed and sometimes surprising results, and computational improvements that don’t always suggest that the issue of cost is resolved, it also doesn’t seem like a full solution. I’m happy my model takes an hour rather than two days to run; I’m not sure this makes the method a compelling basis for decisionmaking.</p>
<p>I was going to extend this to say something about Bayes Nash equilibrium, but my problems with that concept are largely distinct, coming from the “equilibrium” rather than the “Bayes”<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a> but I think I’ve conveyed my basic concerns. I don’t know that I have a compelling alternative, except that it may be the case that while an acceptable and actually feasible theory of decision making may have internal states of some kind, I see no reason that one has to have “beliefs” of any kind, at least as objects which reduce in some way to classical truth values over statements. One can simply have actions, which may on occasion correspond to binary decisions over sets that could in principle be assigned a truth value, though usually they won’t. This seems to have connections to the idea of lazy evaluation in nonparametric Bayes, which permits computations consistent with Bayes rule over a highdimensional space to be retrieved via query without maintaining the full set of possible responses to such queries in memory. But this is only possible in a tractable way while still permitting the results to follow Bayesian inference for fairly limited classes of problems involving conjugacy. More generally, a theory which fully incorporates computational costs will likely have to await further developments in characterizing these costs, a still not fully solved problem in computer science.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>This is something like what theories of “rational inattention” do. However, information processing costs in these theories are taken over a channel between information which still has the representation as a random variable on both sides. The agent is assumed to be on one side of this channel and so is effectively still dealing with information in a fully probabilistic form over which the optimization criterion still requires reasoning to be Bayesian. That is to say, rational inattention is a theory of the information available to an agent, not a theory of the reasoning and decisionmaking process given available information.<a href="#fnref1" class="footnoteback">↩</a></p></li>
<li id="fn2"><p>Roughly, even a computationally unlimited Bayesian agent could not reason itself to Bayes Nash equilibrium, unless it had the “right” priors. Where these priors come from is left unspoken (except that in the model they are “true”), which is a practical problem that drives a lot of differences between applied computational mechanism design, which is forced to answer this question, and the theory we teach our grad students.<a href="#fnref2" class="footnoteback">↩</a></p></li>
</ol>
</div>

Posterior Samplers for Turing.jl
/post/posteriorsamplersforturingjl/
Sun, 28 Jun 2020 00:00:00 +0000
/post/posteriorsamplersforturingjl/
<p>Prompted by a question on the slack for <a href="https://turing.ml/" target="_blank">Turing.jl</a> about when to use which Bayesian sampling algorithms for which kinds of problems, I compiled a quick offthecuff summary of my opinions on specific samplers and how and when to use them. Take these with a grain of salt, as I have more experience with some than with others, and in any case the nice thing about a framework like Turing is that you can switch out samplers easily and test for yourself which is best for your application.</p>
<p>To get a good visual sense of how different samplers explore a parameter space, the animations <a href="https://chifeng.github.io/mcmcdemo/" target="_blank">page by Chi Feng</a> is a great resource.</p>
<p>The following list covers mainly the samplers included by default in Turing. There’s a lot of work in Bayesian compuation with different algorithms or implementations of these algorithms which could lead to different conclusions.</p>
<ol>
<li><p>Metropolis Hastings (MH): Explores the space randomly. Extremely simple, extremely slow, but can “work” in most models. Mainly worth a try if everything else fails.</p></li>
<li><p>HMC/NUTS: Gradientbased exploration, meaning the parameter space needs to be differentiable. It’s fast if that’s true, and so almost always the right choice if you can make your model differentiable (and sometimes so much better that it’s worth making your model differentiable even if your initial model isn’t in order to use it, e.g. by marginalizing out discrete parameters.) There are relatively minor differences between NUTS and the default HMC algorithm.</p></li>
<li><p>Gibbs sampling: A “coordinateascent” like sampler which samples in blocks from conditional distributions. It used to be popular along with factorizable models where conditional distributions could be updated in closed form due to conjugacy. It’s still useful if you can do this, but slow for general models. Its main use now is for combining samplers, for example HMC for the differentiable parameters and something else for the nondifferentiable parameters.</p></li>
<li><p>SMC/“Particle Filtering”: A method based on importance sampling, reweighting draws from an initial draw and repeatedly updating. It is designed to work well if the proposal distribution and updates are close to the targets. The number of particles should be large for reasonable accuracy. Turing’s implementation does this parameter by parameter starting at the prior and updating, which is close to what you want for the most common intended use, state space models with sequential structure, which is the main use case where I would even consider this. That said, tuning the proposals is really important, and more customizable SMC methods are useful in many cases where one has a computationally tractable approximate posterior you want to update to be closer to an exact posterior. This tends to be modelspecific and not a good use case for generic PPLs, though.</p></li>
<li><p>Particle Gibbs (PG), or “Conditional SMC”: like SMC, but modified to be compatible with Metropolis sampling steps. Its main use I can see is as a step in a Gibbs sampler, where it can be used for a discrete parameter, with HMC for the other parts. The number of particles doesn’t have to be overwhelmingly large, due to sampling, but it can cause problems if the number is too small.</p></li>
<li><p>Stochastic Gradient methods (SGLD/SGHMC/SGNHT): Gradient based methods that subsample the data to get less costly but less accurate gradients for an approximation of deterministic gradient based methods (SGHMC approximates HMC, SGLD approximates Langevin descent which also uses gradients but is simpler and usually slightly worse than HMC). These are designed for large data applications where going through a huge data set each iteration may be infeasible. They are popular for Bayesian neural network applications, where optimization methods also rely on data subsampling.</p></li>
<li><p>Variational Inference: Not a sampler per se. It comes up with a parametric model for the posterior shape and then optimizes the fit to the posterior according to a computationally feasible criterion (ie, one which doesn’t require computing the normalizing constant in Bayes rule), allowing you to optimize instead of sampling. In general, this has no guarantee of reaching the true posterior, no matter how long you run it, but if you want a slightly wrong answer very fast it can be a good choice. It’s also popular for Bayesian neural networks, and other “big” models like high dimensional topic models.</p></li>
</ol>

Estimating Treatment Effects with Observed Confounders and Mediators
/publication/confoundersmediators/
Tue, 23 Jun 2020 00:00:00 0400
/publication/confoundersmediators/

Automated Solution of Heterogeneous Agent Models
/publication/automatedsolution/
Sun, 03 Nov 2019 00:00:00 0400
/publication/automatedsolution/

On Online Learning for Economic Forecasts
/post/ononlinelearningforeconomicforecasts/
Tue, 29 Oct 2019 00:00:00 +0000
/post/ononlinelearningforeconomicforecasts/
<p>Jérémy Fouliard, Michael Howell, and Hélène Rey have just released <a href="http://conference.nber.org/conf_papers/f130922.pdf">an update of their working paper</a> applying methods from the field of Online Learning to forecasting of financial crises, demonstrating impressive performance in a difficult forecasting domain using some techniques that appear to be unappreciated in econometrics. Francis Diebold provides <a href="https://fxdiebold.blogspot.com/2019/10/machinelearningforfinancialcrises.html">discussion</a> and <a href="https://fxdiebold.blogspot.com/2019/10/onlinelearningvstvpforecast.html">perspectives</a>. This work is interesting to me as I spent much of the earlier part of this year designing and running a <a href="/Forecasting.html">course on economic forecasting</a> which attempted to offer a variety of perspectives beyond the traditional econometric approach, including that of Online Learning.<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a> This perspective has been widely applied by machine learning practitioners and businesses that employ them, particularly major web companies like <a href="https://ai.google/research/pubs/pub41159">Google</a>, <a href="https://vowpalwabbit.org/">Yahoo, and Microsoft</a>, but has not seen widespread takeup by more traditional economic forecasting consumers and practitioners like central banks and government institutions.</p>
<p>The essence of the online learning approach has less to do with particular algorithms (though there are many), but instead reconsiders the choice of <a href="Forecasting/Evaluation.html">evaluation framework</a>. Consider a quantity to be forecast <span class="math inline">\(y_{t+h}\)</span>, like an indicator equal to 1 in the presence of a financial crisis. A forecasting rule <span class="math inline">\(f(.)\)</span> depending on currently available data <span class="math inline">\(\mathcal{Y}_T\)</span> produces a forecast <span class="math inline">\(\widehat{y}_{t+h}=f(\mathcal{Y}_T)\)</span> which can be evaluated ex post according to a loss function <span class="math inline">\(\ell(y_{t+h},\widehat{y}_{t+h})\)</span> which measures how close the forecast was to being correct. Since we don’t know the true value <span class="math inline">\(y_{t+h}\)</span> until it is observed, to make a forecast we must come up with a criterion instead which compares rules. Traditional econometric forecasting looks at measures of statistical <em>risk</em>,</p>
<p><span class="math display">\[E[\ell(y_{t+h},\widehat{y}_{t+h})]\]</span></p>
<p>where expectation is taken with respect to a (usually unknown) true probability distribution. Online learning, in contrast, aims to provide estimators which have low <em>regret</em> over sequences of outcomes <span class="math inline">\(\{y_{t+h}\}_{t=1}^{T}\)</span> relative to a comparison class <span class="math inline">\(\mathcal{F}\)</span> of possible rules,
<span class="math display">\[\text{Regret}(\{\widehat{y}_{t+h}\}_{t=1}^{T})=\sum_{t=1}^{T} \ell(y_{t+h},\widehat{y}_{t+h})\underset{f\in\mathcal{F}}{\inf}\sum_{t=1}^{T}\ell(y_{t+h},f(\mathcal{Y}_{t}))\]</span></p>
<p>This criterion looks a little odd from the perspective of traditional forecasting rules: <a href="https://fxdiebold.blogspot.com/2017/02/predictivelossvspredictiveregret.html">Diebold has expressed skepticism</a>. First, regret is a relative rather than absolute standard; to even be defined you need to take a stand on rules you might compare to, which can look somewhat arbitrary. If you choose a class of rules that predict poorly, a low regret procedure will not do well in an absolute sense. Second, there is no expectation or probability, just a sequence of outcomes. Diebold frames this as ex ante vs ex post, as the regret cannot be computed until <em>after</em> the data is observed, while risk can be computed without seeing the data. But this does not accord with how regret is applied in the theoretical literature. Risk can be computed only with respect to a probability measure, which has to come from somewhere. One can build a model and ask that this be the “true” probability measure describing the sequence generating the data, but this is also unknown. To get ex ante results for a procedure, one needs to take a stand on a model or class of models. Then one can evaluate results either <em>uniformly</em> over models in the class (this is the classic <a href="Forecasting/StatisticalApproach.html">statistical approach</a>, used implicitly in much of the forecasting literature, like Diebold’s <a href="https://www.sas.upenn.edu/~fdiebold/Teaching221/econ221Penn.html">textbook</a>) or <em>on average</em> over models, where the distribution over which one averages is called a <em>prior distribution</em> and leads to <a href="Forecasting/Bayes.html">Bayesian forecasting</a>. In the online learning context, in contrast, one usually seeks guarantees which apply for <em>any</em> sequence of outcomes, as opposed to over a distribution. So the results are still ex ante, with the difference being whether one needs to take a stance on a model class or a comparison class. There are reasons why one might prefer either approach. For one, <a href="https://itzhakgilboa.weebly.com/uploads/8/3/6/3/8363317/gilboa_notes_for_introduction_to_decision_theory.pdf">standard decision theory</a> requires use of probability in “rational” decision making. But the probabilistic framework is often extremely restrictive in terms of the guarantees it provides on the type of situations in which a procedure will perform well. In general, one must have a model which is correctly specified, or at least not too badly misspecified.</p>
<p>Especially for areas where the economics is still in dispute, the confidence that one has that the models available to us encompass all likely outcomes maybe shouldn’t be so high. This is the content of the Queen’s question to which the title of the FHR paper refers: many or most economists before the financial crisis were using models which did not foresee a particularly high probability of such an event. For that reason, a procedure which allows us to perform reasonably over <em>any</em> sequence of events, not just those likely with respect to a particular model class, may be particularly desirable; a procedure with a low regret guarantee will do so, and be known to do so <em>ex ante</em>, as long as there is some comparator which performed well <em>ex post</em>. Ideally, we would like to remove that latter caveat, but as economists like to say, there is <a href="https://en.wikipedia.org/wiki/No_free_lunch_theorem">no free lunch</a>. One can instead do analysis based on risk without assuming one has an approximately correct model; this is the content of <a href="https://books.google.com/books?hl=en&lr=&id=EqgACAAAQBAJ&oi=fnd&pg=PR7&dq=Vapnik+statistical+learning+theory&ots=g3KhtaV29&sig=5p6V7MW49xnzKUQGoAf7gRJZow0#v=onepage&q=Vapnik%20statistical%20learning%20theory&f=false">statistical learning theory</a>. But this usually involves both introducing a comparison class of models <span class="math inline">\(\mathcal{F}\)</span> to study a relative criterion, the <em>oracle risk</em> <span class="math inline">\(E[\ell(y_{t+h},\widehat{y}_{t+h})]\underset{f\in\mathcal{F}}{\inf}E\ell(y_{t+h},f(\mathcal{Y}_t))\)</span>, or variants thereof.<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a> This requires both a comparison class and some restrictions on distributions to get uniformity; Vapnik considered the i.i.d. case, which is unsuitable for most time series forecasting applications; extensions need some version of stationarity and/or <a href="https://papers.nips.cc/paper/3489rademachercomplexityboundsfornoniidprocesses">weak</a> <a href="https://projecteuclid.org/download/pdf_1/euclid.aop/1176988849">dependence</a>, or strong conditions on the class of nonstationary processes allowed, which can be problematic when one does not know what kind of distribution shifts are likely to occur.</p>
<p>This brings us to the content of the forecast procedures used: FHR apply classic Prediction with Expert Advice algorithms, like versions of Exponential Weights (closely related to the “Hedge” algorithm of <a href="http://rob.schapire.net/papers/FreundSc95.pdf">Freund and Schapire 1997</a>) and Online Gradient Descent (<a href="https://www.aaai.org/Papers/ICML/2003/ICML03120.pdf">Zinkevich 2003</a>), which take a set of forecasts and form a convex combination of them with weights that update each round of predictions. Diebold <a href="https://fxdiebold.blogspot.com/2019/10/onlinelearningvstvpforecast.html">notes</a> that these are essentially versions of <a href="Forecasting/ModelCombination.html">model averaging procedures</a> which allow for timevarying weights, which are frequently studied by econometricians, complaining that “ML types seem to think they invented everything”. To this I have two responses. First, on a credit attribution level, the online learning perspective originates in the studies of sequential decision theory and game theory from people like Wald and Blackwell, squarely in the economics tradition, and the techniques became ubiquitous in the ML field through <a href="http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf">“Prediction, Learning, and Games”</a>, by CesaBianchi and Lugosi, the latter of whom is in an Economics department. So if one wants to claim credit for these ideas for economics, go right ahead. Second, there are noteworthy distinctions between these ideas and statistical approaches to forecast combination. Next, the uniformity over sequences of the regret criterion ensures that not only does it permit changes over time, these can be completely arbitrary and do not have to accord with a particular model of the way in which the shift occurs. So while the approaches can be analyzed in terms of statistical properties, and may correspond to known algorithms for a particular model, the reason they is used is to ensure guarantees for arbitrary sequences, a property which is not shared by general statistical approaches. In fact, a classic result in online model combination (Cf Section 2.2 in <a href="https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf">ShalevShwartz</a>) shows that some approaches with reasonable risk properties, like picking the forecast with the best performance up to the current period, can give unbounded regret for particularly bad sequences. The fact that a combination procedure adapts to these “poorly behaved” sequences is more important than the fact that it gives time varying weights per se.</p>
<p>For these reasons, I think Online Learning approaches at minimum deserve more use in economic forecasting and I am pleased to see the promising results of FHR, as well as the growing application of minimax regret criteria in other areas of economics like <a href="https://arxiv.org/abs/1909.06853">inference and policy</a> under partial identification and areas like <a href="http://yingniguo.com/wpcontent/uploads/2019/09/slidesregulation.pdf">mechanism design</a> where providing a wellspecified distribution over outcomes can be challenging.</p>
<p>There are still many issues that need more exploration, and there are important limitations. One thing existing online methods do not handle well is fully unbounded data; the worst case over all sequences leads to completely uninformative bounds, even for regret. For this reason, forecasting indicators is a good place to start. Whether it is even possible to extend these methods to data with unknown trends is still an open question, which may limit their suitability for many types of economic data. Tuning parameter selection is a topic of active research, with plenty of work on adapting these to the interval length and data features. Methods which perform well by regret criteria but also adapt to the case in which one does have stationary data and so could do well with a modelbased algorithm are also a potentially promising direction. If one has real confidence in a model, it makes sense to rely on it, in which case statistical approaches are fine. But for many applications where the science is less settled and one might plausibly see data that doesn’t look like any model we have written down, we should keep this in our toolbox.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>For a better overview of the field than I can provide, see the survey of <a href="https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf">ShalevShwartz</a>, the monograph of <a href="https://ocobook.cs.princeton.edu/OCObook.pdf">Hazan</a>, or courses by <a href="http://www.mit.edu/~rakhlin/6.883/">Rakhlin</a> or <a href="http://www.mit.edu/~rakhlin/courses/stat928/">Rakhlin and Sridharan</a>.<a href="#fnref1" class="footnoteback">↩</a></p></li>
<li id="fn2"><p>Econometricians are used to thinking of this from the perspective of misspecification a la <a href="https://www.jstor.org/stable/1912526">Hal White</a> in which one compares to risk under the “pseudotrue” parameter value of the best predictor in a class. An alternative, popular in machine learning, is to use a data dependent comparator, the <em>empirical risk</em>, and prove bounds on the generalization gap. Here again we are effectively using the performance of a model or algorithm class for a relative measure.<a href="#fnref2" class="footnoteback">↩</a></p></li>
</ol>
</div>

An Empirical Heterogeneous Agents Models Reading List
/post/anempiricalheterogeneousagentsmodelsreadinglist/
Tue, 20 Nov 2018 00:00:00 +0000
/post/anempiricalheterogeneousagentsmodelsreadinglist/
<p>Inspired by a request by <a href="https://khakieconomics.github.io/" target="_blank">Jim Savage</a> asking for examples of recent work using heterogeneous agent models, I’ve put together a far from comprehensive list of papers demonstrating the range of work being done using these tools to understand a variety of issues at the intersection of macroeconomics and microeconomic data. While the field has a ways to go in terms of econometric modeling, the best recent work involves much more substantial use of data to discipline results and compare alternative hypotheses. While the days of putting together a model with some halfbaked mechanism, “calibrating” a few parameters to whatever values some old person used in a paper that got published, and showing a table of randomly chosen model and data moments to compare for which a J test would soundly reject equality but which for some reason you call a “good fit” are, um, not quite over, recent work now commonly involves actually writing down models which can encompass multiple competing mechanisms, collecting microeconomic data which directly speaks to those mechanisms, and using that data and model to quantitatively evaluate the results. Some of the work is even capable of putting standard errors on the estimates!</p>
<p>The following short list gives a taste of where the field has been moving, and I have some hope that it will continue to move further in this direction.</p>
<h1 id="incomedistributionandtheeconomy">Income Distribution and the Economy</h1>
<p>Handbook Chapter bringing up to date and empirically evaluating the research efforts descending from foundational work of <a href="http://perseus.iies.su.se/~pkrus/ref_pub/250034.pdf" target="_blank">Krusell and Smith 1998</a>:</p>
<p>Krueger, Mitman, and Perri “<a href="https://www.dropbox.com/s/y9yv228pnaaie4i/HandbookKMP.pdf?raw=1" target="_blank">Macroeconomics and Household Heterogeneity</a>” 2016</p>
<h1 id="incomeandwealthdistributionmodeling">Income and Wealth Distribution Modeling</h1>
<p>Gabaix, Lasry, Lions, Moll “<a href="https://scholar.harvard.edu/files/xgabaix/files/dynamics_of_inequality.pdf" target="_blank">The Dynamics of Inequality</a>” <em>ECTA</em> 2016</p>
<p>Hubmer, Krusell, and Smith “<a href="https://economics.yale.edu/sites/default/files/files/Graduate/Uploads/HubmerKrusellSmith_Wealth2017.pdf" target="_blank">The Historical Evolution of the Wealth Distribution: A
QuantitativeTheoretic Investigation</a>” 2017</p>
<h1 id="monetary">Monetary</h1>
<p>Most influential recent paper:</p>
<p>Kaplan, Moll, and Violante “<a href="http://violante.mycpanel.princeton.edu/Workingpapers/HANK_revision_MASTER.pdf" target="_blank">Monetary Policy According to HANK</a>” <em>AER</em> 2018</p>
<p><em>See also</em>:</p>
<p>Auclert “<a href="http://web.stanford.edu/~aauclert/mp_redistribution.pdf" target="_blank">Monetary Policy and the Redistribution Channel</a>” 2018
and <a href="https://aauclert.people.stanford.edu/research" target="_blank">several other papers by Auclert</a></p>
<p>Gornemann, Kuester and Nakajima “<a href="https://drive.google.com/file/d/0BxRm9kW6_YBqWFRrajRLWGJ0SHc/view" target="_blank">Doves for the Rich, Hawks for the Poor? Distributional Consequences of Monetary Policy</a>” <em>ECTA forthcoming</em></p>
<h1 id="development">Development</h1>
<p>The <a href="http://townsendthai.mit.edu/" target="_blank">Townsend Thai Project</a>, which collects extremely detailed spending, income and credit data from rural Thai villages, has inspired a large number of <a href="http://townsendthai.mit.edu/papers/" target="_blank">papers using the data</a> along with quantitative heterogeneous agent models to understand credit in rural economies. See as an example</p>
<p>Buera, Kaboski, and Shin “<a href="https://www3.nd.edu/~jkaboski/BKS_MacroMicro.pdf" target="_blank">The Macroeconomics of Microfinance</a>” 2017</p>
<h1 id="additionaltopics">Additional Topics</h1>
<p>Literally anything by <a href="https://voices.uchicago.edu/vavra/research/" target="_blank">Joe Vavra</a> is several standard deviations in quality above the rest of the field in terms of careful, detailed empirical modeling. See, eg his work on <a href="https://cpbusw2.wpmucdn.com/voices.uchicago.edu/dist/7/914/files/2018/03/housing8162017riidnybody23xw55t.pdf" target="_blank">House Prices and Consumer Spending</a> or <a href="https://cpbusw2.wpmucdn.com/voices.uchicago.edu/dist/7/914/files/2018/03/econometrica_body2kbvfl4.pdf" target="_blank">Durables Consumption over the Business Cycle</a> and his work using BLS micro pricing data with <a href="https://cpbusw2.wpmucdn.com/voices.uchicago.edu/dist/7/914/files/2018/03/qje_final_online2asbwxj.pdf" target="_blank">heterogenous firm models of price dispersion</a>.</p>
<p><a href="https://sites.google.com/site/kyleherkenhoff/research" target="_blank">Kyle Herkenhoff</a> is another researcher doing detailed empirical work in this field, with a focus on consumer credit.</p>

Solution of Rational Expectations Models with Function Valued States
/publication/functionvaluedstates/
Mon, 15 Jan 2018 00:00:00 0500
/publication/functionvaluedstates/
<p>Previous versions of this paper circulated under the title <em>On the Solution and Application of Rational Expectations Models with FunctionValued States</em></p>

Top Papers 2017
/post/toppapers2017/
Thu, 07 Dec 2017 00:00:00 +0000
/post/toppapers2017/
<p>
Inspired by Paul GoldsmithPinkham and following on <a href="https://www.bloomberg.com/view/articles/20171204/thebestbooksandresearchoneconomicsin2017">Noah Smith</a> and others in an end ofyear tradition, here is a notquiteordered list of the top 10ish papers I read in 2017. I read too many arxiv preprints and older papers to choose ones based on actual publication date, so these are chosen from the “Read in 2017” folder of my reference manager, which tells me that I have somehow read 176 papers (so far) in 2017. There was a lot of chaff in this set, and many more good works still sitting in my “to read” folder, but I did manage to find a few gems, which I can list and briefly describe, in sparselypermuted alphabetical order.
</p>
<ol style="liststyletype: decimal">
<li>
Arellano, Blundell, & Bonhomme <a href="http://www.ucl.ac.uk/~uctp39a/ABB_Ecta_May_2017.pdf">Earnings and Consumption Dynamics: A Nonlinear Panel Data Framework</a> Econometrica 2017
</li>
</ol>
<p>
This paper solves a problem which one would think would have been tackled a long time ago, but turned out to require several modern tools. What is the reduced form implied by intertemporally optimizing dynamic decision models of the kind underlying quantitative heterogeneous agent macro models, and can it be identified and estimated from micro panel data? They show that the form is a nonparametric Hidden Markov Model, and given long enough panels and some standard completeness conditions the distributions can be nonparametrically identified and estimated using conditional distribution estimation methods based on sieve quantile regressions. This seems like a good start to taking seriously the implications of these models and reformulating them to match micro data.
</p>
<p>
See also: their <a href="http://www.cemfi.es/~arellano/AR_Survey_Revised_2.pdf">survey</a> in the Annual Review of Economics showing how most dynamic models used in macro correspond to their framework.
</p>
<ol start="2" style="liststyletype: decimal">
<li>
Susan Athey & Stefan Wager <a href="http://arxiv.org/abs/1702.02896">Efficient Policy Learning</a>
</li>
</ol>
<p>
Part of a growing literature on learning optimal (economic) policies from data by directly estimating the policy to maximize welfare, rather than learning model parameters which are only ex post used in policy exercises without proper accounting for uncertainty. This paper focuses on the binary policy case, in the presence of unconfounded observational data on program results: find a policy rule <span class="math inline"><span class="math inline">\(\pi(X)\)</span></span> out of some class of policies which apply a program or not to an individual with covariates X. This paper takes a minimax approach, using a doubly robust estimator and showing approximately optimal approximation bounds on regret (relative to the unknown best policy in the class) over possible nonparametric structural parameters, using some novel bounds. These bounds rely strongly on the binary structure, so extending to more complicated policy problems may take some work, but the approach seems highly promising.
</p>
<p>
See also: <a href="https://arxiv.org/abs/1704.06431">Luedtke and Chambaz</a> who claim to achieve faster (<span class="math inline"><span class="math inline">\(\frac{1}{n}\)</span></span> instead of <span class="math inline"><span class="math inline">\(\frac{1}{\sqrt{n}}\)</span></span>) rates in the same problem. This appears to have motivated an update to the original version of the Athey Wager paper showing that the <span class="math inline"><span class="math inline">\(\frac{1}{\sqrt{n}}\)</span></span> rate and their bound is (worst case) optimal in the shrinking signal regime, where the size of the treatment effect function is of the same order as the noise, when the measurement issue is of first order importance for policy, unlike in the fixed signal size regime, where statistical uncertainty has lower order effect.
</p>
<ol start="3" style="liststyletype: decimal">
<li>
Max Kasy <a href="https://scholar.harvard.edu/kasy/publications/optimaltaxationmachinelearning">“A Machine Learning Approach to Optimal Policy and Taxation”</a>
</li>
</ol>
<p>
In the same vein as above, but allows continuous instead of binary policies, and takes a Bayesian instead of a minimax approach, advocating nonparametric Gaussian process methods for estimating unknown policy effects, which are shown to fit naturally in many optimal policy problems, and allow straightforward Bayesian decision analysis to be used, which has advantages for composition with other types of problems.
</p>
<ol start="4" style="liststyletype: decimal">
<li>
Andrii Babii <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2962746">Honest confidence sets in nonparametric IV regression and other ill ‐ posed models</a>
</li>
</ol>
<p>
Confidence bands for NPIV and similar Tikhonovregularized inverse problems. It’s wellknown that Tikhonov methods are optimal only up to a certain level of illposedness, and these bounds may be a bit conservative, but results should be quite useful in a variety of settings.
</p>
<ol start="5" style="liststyletype: decimal">
<li>
Beskos, Girolami, Lan, Farrell, & Stuart <a href="http://arxiv.org/abs/1606.06351">Geometric MCMC for InfiniteDimensional Inverse Problems</a>
</li>
</ol>
<p>
Hamiltonian MonteCarlo for nonparametric Bayes! Introduces a family of dimensionfree MCMC samplers which generalize the hits of finite dimensional MCMC, from metropolis Hastings to MALA to HMC, to the case of an infinitedimensional parameter. The issue of mutual singularity of infinitedimensional measures requires the posterior to be nonsingular with respect to the original, e.g., Gaussian Process prior (meaning, among other things, no tuning of length scale parameters), but within this restricted class, it allows models based on nonlinear transformations which make it applicable to computational inverse problems like PDE models and so greatly expands the class of feasible nonparametric Bayesian methods without relying on conjugacy or variational inference or many of the similar processdependent or poorly understood tricks used to extend Bayes to the highdimensional setting.
</p>
<p>
See also: Betancourt <a href="http://arxiv.org/abs/1701.02434">A Conceptual introduction to Hamiltonian Monte Carlo</a>, a brilliant and beautifully illustrated overview of how (finitedimensional) HMC works and how to apply it.
</p>
<ol start="6" style="liststyletype: decimal">
<li>
Chen, Christensen, Tamer <a href="http://arxiv.org/abs/1605.00499">MCMC Confidence Sets for Identified Sets</a>
</li>
</ol>
<p>
A clever idea for inference in partially identified models. They note that in many cases, even when structural parameters are nonidentified, the identified set itself is a regular parameter (or deviates from regularity in a tractable way) in the sense of inducing a (possibly nonclassical) Bernstein von Mises theorem, so the Bayesian posterior for the identified set itself (though not the parameter) converges to a wellbehaved distribution. For models defined by a (quasi)likelihood, these sets are essentially the minima of the criterion function, and so CCT show that in the limit, the quantiles of the likelihood in MCMC samples from the (quasi)posterior define a valid frequentist confidence interval for the identified set. As the fully Bayesian posterior will in general concentrate inside the identified set (this is what having prior information means), one can then easily extract both Bayesian and valid frequentist assessments of uncertainty from the same MCMC chain, even without identification. This approach is conservative if you want inference for the parameter itself, for which the available methods often involve a bunch of delicate tuning parameter manipulation, and does not seem applicable to many cases where the identified set is itself irregularly identified, or in many cases of weak identification, which seems dominant in the kinds of models I tend to work on, but this is why this field is so active.
</p>
<p>
See also: Too many papers to list on weak or partial identification in specific settings. Andrews & Mikusheva (2015) and Qu (2014) on DSGEs.
</p>
<ol start="7" style="liststyletype: decimal">
<li>
Rachael Meager <a href="http://economics.mit.edu/files/12292">Aggregating Distributional Treatment Effects : A Bayesian Hierarchical Analysis of the Microcredit Literature</a>
</li>
</ol>
<p>
A spearhead in a hopefully coming paradigm shift in economics toward taking seriously the issues of “tiny data” in the bigdata era. Some clever mix of economically and structurally informed parametric modeling and Bayesian computation for aggregating information in quantile treatment curves across studies, with an application based on 7 data points (!) which themselves are estimated functions derived from mediumscale field experiments. This hierarchical paradigm of sharing information between flexibly modeled components for parts where data is abundant and more judiciously parameterized components where it is not, in a seamless way, seems to characterize a useful and principled approach to an omnipresent situation in economic data with applications far beyond the program evaluation context.
</p>
<ol start="8" style="liststyletype: decimal">
<li>
Ulrich Müller & Mark Watson <a href="http://www.princeton.edu/~umueller/lfcorr.pdf">Long Run Covariability</a>
</li>
</ol>
<p>
Speaking of tiny data, this is the latest in Müller and Watson’s series on long run and low frequency modeling for time series, using cosine expansions to explicitly bring out the smallsample nature of the problem of learning about longrun behavior, given that we are only getting new data points on 50 year periods approximately every 50 years. This paper extends from univariate to multivariate modeling, offering a simple alternative to cointegration based approaches which restrict our modeling of long term relationships based on a very particular parametric structure. The new methods don’t alleviate the need for simple parametric models in this tiny data space, but they do show explicitly how the problem is that of fitting a curve to a scatterplot with 1012 points at most, and so permit use of explicit small sample methods to describe and analyze the data.
</p>
<p>
See also: their survey paper on this research agenda, <a href="http://www.princeton.edu/~umueller/ULFE.pdf">“Low Frequency Econometrics”</a>
</p>
<ol start="9" style="liststyletype: decimal">
<li>
Mikkel PlagborgMøller <a href="https://scholar.princeton.edu/sites/default/files/mikkelpm/files/irf_bayes.pdf">Bayesian Inference on Structural Impulse Response Functions</a>
</li>
</ol>
<p>
An alternative to Bayesian VARs, which put priors on autoregression coefficients and then derive IRFs under some bizarre exact or partial exclusion restrictions, you can put priors on IRFs directly, essentially by putting priors on the moving average representation instead. This also has some nice asymptotic theory, showing that the autocovariances are the part which is actually identified, and the posterior converges to a distribution over the set of IRFs consistent with a given spectrum, with priors providing weights inside of that by marginalization to the identified set. The results here impose a finite order vector MA representation, which reduces comparability with VARbased methods, though it doesn’t seem obviously worse.
</p>
<ol start="10" style="liststyletype: decimal">
<li>
AlShedivat, Wilson, Saatchi, Hu, and Xing <a href="http://arxiv.org/abs/1610.08936">Learning Scalable Deep Kernels with Recurrent Structure</a>
</li>
</ol>
<p>
While MCMC methods are now becoming practical for the small data regime in Bayesian nonparametrics, and conjugacy based methods like Gaussian process regression allow work in the medium data regime, for even moderately large data sets, the cubic complexity of these methods makes them impractical, and so a huge literature on simplifications or approximations has developed. Restricting covariance processes allows reduction to quadratic (certain kernel approximations), linearithmic (spectral methods) or even linear time, but many of these methods cost a lot in terms of expressivity. A Gaussian process is already modeling the mean of the unobserved points as a linear function of the observed values, and approximations strongly restrict the coefficients. This paper, by among others, some folks at CMU, offers a linear time GP approximation method which seems to offer a nice compromise. Using choice of points to approximate and some interpolation techniques, they can get the numerical approximation costs down a lot, but the method allows for quite complex kernels, including, in this case, a kernel parametrized by a (recurrent) neural network optimized along with the process, which allows quite a bit of flexibility. This kind of merger of classical nonparametric Bayes and neural network methods seems very promising, and this is just one bit of an explosion of approaches to combining the two methods, but this seems like a very practical one.
</p>
<p>
See also: Many papers from the Blei lab at Columbia, with a variety of approaches for speeding up Bayes, often relying on variational inference, which is based on approximating posteriors via optimization of (normalizationconstantfree) lower bounds. The Blei lab has a lot of work attempting to turn variational inference from black magic into science, though I still find it all quite mysterious. <a href="http://arxiv.org/abs/1601.00670">This tutorial</a> isn’t a bad intro for applications. The <a href="http://www.nowpublishers.com/article/Details/MAL001">classic book by Wainwright and Jordan</a>, which offers a theoretical perspective, is an imposing weight on my toread list.
</p>
<div id="takeaways" class="section level2">
<h2>
Takeaways
</h2>
<p>
This year was big on Bayes for me, reflecting both my research interests and the movement in the profession, which is looking at principled approaches to mixing theory and data to allow the data to take precedence when it is abundant (nonparametric methods) and let theory pull some weight in the subcomponents of the problem where it is not (macro time series, for one, causal inference in observational data for another). This also reflects the longawaited publication of the Ghosal and van der Vaart Bayesian Nonparametrics textbook, which sent me down several lines of inquiry and displaced reading papers for me for much of the summer.
</p>
</div>

Computational Methods for Economic Models with Function Valued States
/publication/thesis/
Sun, 01 May 2016 00:00:00 0400
/publication/thesis/
<p>Portions of this PhD thesis have been adapted into the paper <em>Solution of Rational Expectations Models with Function Valued States</em>. The thesis contains additional results including a very different algorithm for the noncompact case, expanded analysis of the economic geography model, and additional numerical applications.</p>

A Nontechnical Introduction to My Thesis
/post/anontechnicalintroductiontomythesis/
Fri, 08 Jan 2016 00:00:00 +0000
/post/anontechnicalintroductiontomythesis/
<p><span style="fontsize: xsmall;"><u>Attention Conservation Notice</u>: Over 2000 words, written for an intended audience of noneconomists, describing my thesis, which is supposed to be about Inequality, which you probably care about, but in fact is mostly about algorithms, which maybe only some of you care about. Most of the length is a discussion of rational expectations models which will be old news to economists and slightly bizarre to those who aren’t.</span><br />
<br />
Given that I am currently in the midst of completing graduate school and making the transition to other (potentially better) things, I thought it would be a good time to talk about what I’ve been doing with the past, say, 5 to 7 years of my life. Generally speaking, the process of a PhD thesis, especially in Economics, involves finding a Big Important Topic, then focusing on some specific aspect of that topic and reducing it down to a manageable, welldefined, technical question that can be answered, with some ingenuity and hard work, by a single student over a reasonable time frame, with the hope of adding some small bit to human understanding of the Big Important Topic.<br />
<br />
For me, the ‘Big Important Topic’ was <b>Inequality</b>. Some people are poor and some people rich. Why? How much? Who are the rich and the poor and where do they live? How has it changed over time, and when, and how will it change in the future? What can we do about it? What should we do about it? What does it mean for society, politics, the world? These are issues people care about, passionately, as evidenced by recent political movements and the astounding popular impact of work by <a href="http://www.amazon.com/CapitalTwentyFirstCenturyThomasPiketty/dp/067443000X/">Piketty</a> and others that attempts to make some progress in documenting and understanding recent and historical changes in inequality. In <a href="https://sites.google.com/site/davidbchilders/DavidChildersFunctionValuedStates.pdf">my thesis</a>, I provide the answer to, approximately, <i>none</i> of these questions.<br />
<br />
So, hold up, why not? What’s the problem? Well, the main issue is that figuring out what’s going on is <i>hard</i>. Although the profession is making rapid progress in the area of documenting inequality, and there is some admirable work beginning to untangle various aspects of the the what, how, and why, this latter area is still far from consensus. I was pleased to see this morning a Paul Krugman post <a href="http://krugman.blogs.nytimes.com/2016/01/08/economistsandinequality/">acknowledging how little is still known</a> in the area of modeling the distribution of income and how and why it is changing, since this is precisely the shortcoming I’m trying to contribute, in some small way, to solving. How do we look at the historical relationships between all the things that go on in a society and economy (Krugman gives an example of one of these things as Piketty’s theory of ‘rg and all that’) and the income distribution and go about figuring out what causes what and predicting the relationship in the future?<br />
<br />
There are various ways to go about looking at this question. One traditional way is to gather data on the income and wealth over time of (some representative sample of) individuals in the economy and come up with some descriptive model of how these are evolving and how people behave, then aggregate up to see what this implies for the distributions, maybe tweaking the model until the shape and behavior of the distribution and the behavior at the individual level match what we observe in reality. This approach is quite old, and arguably has taught us a lot about what kinds of patterns of behavior are needed to explain the shape of inequality, as documented going back at least to <a href="http://glineq.blogspot.com/2015/02/whatremainsofpareto.html">Pareto</a>. In particular, the contributions of, say, <a href="http://doi.org/10.2307/1911802">Mandelbrot (1961)</a> describing dynamic origins of thick tails in distribution of income <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1257822">hold up reasonably well</a> as an approximate descriptive theory of the origins of upper tail inequality (i.e. the prevalence of the superrich), though <a href="http://fguvenen.com/research/">more precise modern measurements</a> yield a slightly more nuanced picture with less stylized and homogeneous descriptions of individual behavior. Will this approach, suitably augmented with more precise modern data, yield the secrets of how we got here, where we’re going, and what we can do about it? I’ll offer a resolute ‘maybe.’<br />
<br />
To see what might be missing, I need to take a detour through, essentially, science fiction. A serious issue in the history of economic thought is what I like to call the ‘Foundation’ problem, after the work of science fiction writer Isaac Asimov, in which he envisioned future social scientists so advanced that they can accurately predict all the major events that will happen in a society decades or centuries into the future (clearly these were farfetched works of speculative fiction). A major problem he envisioned in this work was that if you have such a theory, and people know this, it will affect their behavior, causing them to change how they act and invalidating the theory. In the books, he solves the issue by having social scientists keep their work secret, but this isn’t the only approach.<br />
<br />
Around the 1960s or so, some economists, worried about writing down selfinvalidating theories, began to try another approach: write down a theory which contains itself, which can predict its own impact on society. This is not actually as hard as it sounds, or at least not usually. If you understand the computer science concept of recursion, or the mathematical concept of an eigenvector <a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a> (or, more broadly, a fixed point), the theories incorporate a simple application of these techniques. Essentially, what models of this type, which have acquired the slightly misleading name of “rational expectations” do is, instead of including a direct description of the behavior of each person, determined from data, they describe what the person wants, what the person is capable of doing, and what they know, including a complete description of the model itself, and then work out what they will decide to do, given the environment around them which is determined in part by what everyone else is deciding to do. People’s desires can then be inferred (in part), by looking at what they decide to do in different situations, and used to predict what they might do when faced with situations they haven’t yet experienced.<br />
<br />
In this way, economists hope to predict what happens when new policies are implemented or unprecedented situations arise, with predictions that don’t necessarily rely on people systematically ignoring the predictions of models which tell them what might happen and so how to avoid unfavorable outcomes. For example, say you have a person driving a car in a straight line along a highway. A crude behavioral model might observe this and presume that the car will always keep moving forward. But if there is a cliff straight ahead, a reasonable guess is that the driver will turn or stop before the cliff rather than continue to drive straight through and fall off. In practice, the earliest models of this type were quite crude, predicting that people would always avoid any preventable bad outcomes and so policy could not systematically make people better off. If you see a car going off a cliff, maybe the driver was Evel Knievel and he just loves driving over cliffs, or maybe they were suicidal and really wanted to fall down a ravine, but either way there’s no real point in intervening because the driver is getting just what they wanted. But of course sometimes an economy does fall off a proverbial cliff and based on the screams on the way down it certainly doesn’t seem like the driver is getting what they want. More sophisticated models incorporated a variety of possible impediments to people achieving the outcomes they want, such as ignorance, inability to cooperate, and so on. Sometimes the car’s windshield is fogged up and they don’t see the upcoming obstacle, or the brakes are out, or the driver doesn’t have the reaction time to avoid an obstacle in time, and even if they’re driving carefully, accidents can happen, and these ‘frictions’ are all more plausible guesses than that the driver was just going straight no matter what.<br />
<br />
This shift in focus, from crude behavioral rules to an investigation of the decisionmaking process and associated ‘frictions’ as sources of unfavorable outcomes led to widespread popularity of the rational expectations paradigm among academics across the ideological spectrum, and over time it became standard to incorporate in essentially any economic model. This was not without some dissent, since a case can be made that sometimes people really do just have no clue what they’re doing and behave according to <a href="http://www.amazon.com/ThinkingFastSlowDanielKahneman/dp/0374533555/">simple rules of thumb</a>. My personal read of this literature is that the criticism is entirely fair as a description of a wide range of human behavior, but that some people, sometimes, make incredibly sophisticated decisions and the tools of rational expectations modeling can help describe, at least approximately, these behaviors which may be extremely hard to extrapolate from simple rules observed in lab experiments. Maybe more pertinently, it can also highlight the cases when extrapolating the simple rules leads to really terrible decisions, in which case one might at least suspect that some subset of people will figure that out and decide to do something different. In any case, whether for good or bad reasons, in many parts of the economics profession, rational expectations modeling appears to be <a href="http://ineteconomics.org/ideaspapers/blog/matchingthemomentbutmissingthepoint"><i>de rigueur</i></a>, and it does appear to be the main tool used to ensure that our models not only match observed behavior, but also yield reasonable guesses of what might happen in new situations and under new policies.<br />
<br />
So, what does this have to do with inequality? Well, the tools used to figure out how an economy behaves under rational expectations are often rather computationally intensive. In fact, this is so much the case that it is often difficult or infeasible to solve such a model unless everybody behaves exactly the same way, or maybe no more than 2 or 3 different ways. Many models are in fact solved under the assumption of a single `representative agent’, to make the problem feasible. This had the side effect, beyond lack of realism, of making inequality completely disappear from macroeconomic models. If you wanted to study inequality and the macroeconomy, you either had to abandon the rational expectations paradigm, simplify the model drastically, possibly by requiring inequality not to change over time, find some impossibly clever special case in which the computational difficulties don’t arise, or some combination of these three. Much high quality work has been done in each of these cases, and economists interested in inequality certainly have put in substantial effort to adapt and learn within the limitations of each approach.<br />
<br />
To give an idea of the difficulty of the problem of studying inequality in a rational expectations setting, it has recently attracted the attention of Fields Medalist <a href="http://mfglabs.com/mathematicalresearch/">PierreLouis Lions</a>, who has built a framework for describing the evolution of inequality under rational expectations using coupled systems of forward and backward partial differential equations. While these results provide a general framework for describing the problem, so far results have been confined to the deterministic case, in which unpredictable random fluctuations which match the rough patterns over time observed in actual economic data are completely absent. As a result, the framework can only yield qualitative rather than quantitative predictions about how inequality will change, and so is unsuitable for modeling the detailed statistical data on inequality now being gathered by empirical economists.<br />
<br />
Finally, we get to my contribution. More or less, I came up with a way to solve models with inequality, rational expectations, and random shocks, in a computationally feasible way. Well, I came up with an algorithm to do so approximately, though with a provably accurate and precise approximation. In practice, this involves taking the description of the system and solving a special sort of eigenproblem, like an eigenvector problem, but over the functions representing the income distribution rather than finite vectors.<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a> If this sounds familiar to you, you probably took a course in PDEs in college, or in some field that <a href="https://en.wikipedia.org/wiki/Wave_equation">studies</a> <a href="https://en.wikipedia.org/wiki/Heat_equation">applied</a> <a href="https://en.wikipedia.org/wiki/Schr%C3%B6dinger_equation">PDEs</a>, and that is essentially where I found the tools to find a solution. The hope is that this method can then be used to incorporate inequality into all sorts of economic models with variation in all sorts of economic variables, to compare to past variation and potential future outcomes, and evaluate different sources and effects of inequality quantitatively, in a framework compatible with economists’ standard tools for policy analysis. The dissertation also includes some very preliminary attempts to get started on this agenda, but the main contribution is the algorithm, which I hope will be of use to other economists who want to develop and test out their own theories of inequality, and make their own contributions to understanding this Big Important Topic.<br />
<br />
<br /></p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p><a href="http://www.scottaaronson.com/blog/?p=1820">Here</a> is a nice intuitive discussion explaining recursion, circular logic, and eigenvectors with some silly applications to the completely unserious topic of moral philosophy.<br /><a href="#fnref1" class="footnoteback">↩</a></p></li>
<li id="fn2"><p>In fact, the results are significantly more general, and apply to any rational expectations model with some distribution changing over time, whether that’s income or wealth or something completely different like individual decisions or beliefs or locations or what have you. More generally, any functions can be used, so differences across firms or assets or cities or social network nodes can all be examined with the algorithm.<br />
<br />
<br /><a href="#fnref2" class="footnoteback">↩</a></p></li>
</ol>
</div>

Top Papers Read in 2015
/post/toppapersreadin2015/
Sun, 27 Dec 2015 00:00:00 +0000
/post/toppapersreadin2015/
<div>
So, inspired by <a href="http://www.econpointofview.com/2015/12/bestoldjournalarticlesireadin2015/">Brian</a> and the general spirit of endofyear reflection, some thoughts on what I’ve read this year.
</div>
<div>
<br />
</div>
<div>
According to my reference manager software, I’ve read 183 papers this year, which is somewhat overstated because many were read last year but are dated incorrectly, and a substantial portion of the list contains slides, lecture notes, or other documents not quite meriting the status of article. I think as I’ve progressed through grad school I find it much easier to get through technical material than I once did, though some of the volume reflects a period in the spring when I was reading quite a bit to find a new research topic.
</div>
<div>
<br />
</div>
<div>
In, apparently, reverse alphabetical order, a list of papers I enjoyed this year, making no claim of endorsement of conclusions, just ones I enjoyed reading or from which I felt I learned a lot. A star indicates papers read directly for some sort of work.
</div>
<div>
<br />
</div>
<div>
SanzSolé, Marta (2008) <a href="http://www.ub.edu/plie/SanzSole/cursos/lecturenoteslondon.pdf">Applications of Malliavin Calculus to Stochastic Partial Differential Equations</a>
</div>
<div>
Rakhlin, Alexander & Karthik Sridharan (2015) <a href="http://arxiv.org/abs/1503.01212">Hierarchies of Relaxations for Online Prediction Problems with Evolving Constraints</a>
</div>
<div>
Paul, Arnab & Suresh Venkatasubramanian (2015) <a href="http://arxiv.org/abs/1412.6621">Why Does Deep Learning Work?  A Perspective From Group Theory</a>
</div>
<div>
Mukherjee, Sayan & John Steenbergen (2013) <a href="http://arxiv.org/abs/1310.5099">Random Walks on Simplicial Complexes and Harmonics</a>
</div>
<div>
Mohammed, SalehEldin, Tusheng Zhang, & Huaizhong Zhao (2008) <a href="http://bookstore.ams.org/MEMO196917">The Stable Manifold Theorem for Semilinear Stochastic Evolution Equations and Stochastic Partial Differential Equations</a>*
</div>
<div>
Méléard, Sylvie (1996) <a href="http://www.springerlink.com/index/G772652P33H7105R.pdf">Asymptotic Behaviour of some interacting particle systems; McKeanVlasov and Boltzmann models</a>*
</div>
<div>
Kadri, Hachem et. al. (2015) <a href="http://arxiv.org/abs/1510.08231">Operatorvalued Kernels for Learning from Functional Response Data</a>
</div>
<div>
Jakab, Zoltan & Michael Kumhof (2015) <a href="http://www.bankofengland.co.uk/research/Documents/workingpapers/2015/wp529.pdf">Banks are not intermediaries of loanable funds — and why this matters</a>
</div>
<div>
Itô, Kiyosi (1983) <a href="http://www.springerlink.com/index/VLV462170NT62231.pdf">Distributionvalued processes arising from independent Brownian motions</a>*
</div>
<div>
Hairer, Martin (2014) <a href="http://www.hairer.org/notes/Regularity.pdf">Introduction to regularity structures</a>
</div>
<div>
Hahn, Jinyong, Guido Kuersteiner, Maurizio Mazzocco (2015) <a href="http://arxiv.org/abs/1507.04415">Estimation with Aggregate Shocks</a>
</div>
<div>
Guéant, Olivier, Jeanmichel Lasry, & PierreLouis Lions (2011) <a href="http://mfglabs.com/wpcontent/uploads/2012/12/parisprinceton.pdf">Mean field games and applications</a>*
</div>
<div>
Gao, Tingran (2015) <a href="http://arxiv.org/abs/1503.05459">Hypoelliptic diffusion maps I: tangent bundles</a>
</div>
<div>
Florens, JeanPierre & Sébastien Van Bellegem (2014) <a href="http://econpapers.repec.org/RePEc:eee:econom:v:186:y:2015:i:2:p:465476">Instrumental variable estimation in functional linear models</a>
</div>
<div>
Faust, Jon, & Eric M. Leeper (2015) <a href="https://www.kansascityfed.org/~/media/files/publicat/sympos/2015/econsymposiumfaustleeperpaper.pdf">The Myth of Normal: The Bumpy Story of Inflation and Monetary Policy</a>
</div>
<div>
Desmet, Klaus, Dávid Krisztián Nagy, & Esteban RossiHansberg (2015) <a href="https://www.princeton.edu/~erossi/GD.pdf">The Geography of Development: Evaluating Migration Restrictions and Coastal Flooding</a>
</div>
<div>
Creal, Drew (2009) <a href="http://faculty.chicagobooth.edu/drew.creal/research/papers/creal2009survey.pdf">A survey of sequential Monte Carlo methods for economics and finance</a>
</div>
<div>
Cohen, Albert, Marc Hoffmann, & Markus Reiss (2004) <a href="http://math.uniheidelberg.de/studinfo/reiss/Publikationen/GalInvProb.pdf">Adaptive Wavelet Galerkin Methods for Linear Inverse Problems</a>
</div>
<div>
Chen, Xiaohong & Timothy Christensen (2015) <a href="http://dx.doi.org/10.1016/j.jeconom.2015.03.010">Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions I</a>*
</div>
<div>
Beylkin, G., Coifman, Ronald & Vladimir Rokhlin (1991) <a href="https://amath.colorado.edu/faculty/beylkin/papers/BECORO1991.pdf">Fast Wavelet Transforms and Numerical Algorithms I</a>*
</div>
<div>
Belloni, Alexandre, Victor Chernozhukov, & Kengo Kato (2014) <a href="http://economics.mit.edu/files/10492">Uniform postselection inference for least absolute deviation regression and other Zestimation problems</a>
</div>
<div>
<br />
</div>
<div>
This list is by no means representative of my reading for the year, covering substantially less on topics where I read a lot due to research, especially on functional data analysis, and probably more on the mathematics than the economics side, I suppose because those results are more novel to me. The selection of papers is largely by whim and happenstance: I would do well to read more systematically, especially classic journal articles, rather than whatever recent work catches my eye. I could say a bit about why I liked each of these papers, but maybe better to keep this short, so just ask if you’re interested.
</div>
<div>
<br />
</div>
<div>
I will say that the standout, read in March and a clear winner for best paper I read this year, is <a href="http://wwwstat.wharton.upenn.edu/~rakhlin/">Rakhlin</a> and <a href="http://www.cs.cornell.edu/~sridharan/">Sridharan</a> on nonstationary prediction on graphs. It extends their <a href="http://wwwstat.wharton.upenn.edu/~rakhlin/courses/stat928/stat928_notes.pdf">essential earlier work</a> on regretbased statistical learning theory for nonstationary forecasting to learning about processes on graphs, bringing together results from a host of areas to elucidate the relationship between learning, structure, and computation. The work that Rakhlin and Sridharan are doing isn’t directly related to what I do day to day, but I think in the long run it stands a chance of completely changing the way we go about learning from and using data, so it’s very much worth keeping track of.
</div>

Aggregate shocks in crosssectional data, or the alternative to a macroeconomic model isn't no macroeconomic model, it's a bad macroeconomic model
/post/aggregateshocksincrosssectionaldata/
Mon, 20 Jul 2015 00:00:00 +0000
/post/aggregateshocksincrosssectionaldata/
<p>Inspired by the release of <a href="http://arxiv.org/abs/1507.04415">a new and quite clear explainer</a> on the topic by Hahn, Kuersteiner, and Mazzocco (HKM) amid a growing trend of using microeconomic data to learn about macroeconomic or aggregate effects, I believe it’s a good time to write something about what microeconometricians and applied microeconomists ought to know about dealing with aggregate effects. Broadly, this refers to any timedependent variability in a datagenerating process that can’t be modeled is independent across individual observations. In economic data, this usually comes about from changes in prices, preferences, technologies, information, or institutions shared by all units of observation. Depending on the setting and the statistical methods used, even if one’s primary interest is in purely individuallevel variation, this variability can affect the identification and estimation of microlevel parameters.</p>
<p>The primary way such effects are handled in microeconomic research, especially based on crosssectional data, is to ignore them completely. The second most common way to handle variability in aggregates, when using panel or repeated crosssection data, is to include a set of time dummies or time trend in the set of regressors. An even more sophisticated approach is to also allow for timevarying standard errors in the context of heteroskedasticityrobust inference. While each of these approaches is often reasonable, they implicitly embed assumptions regarding the aggregate environment which require justification if the estimates are to have a meaningful interpretation.</p>
<p>Consider the most common option, simply not incorporating any aggregate effects. If the sample is drawn from a specific population at a specific time and so all faces the same aggregate environment and a parameter is estimated from the variability within the population, a common interpretation is that the estimate is valid ‘conditional on’ the environment. Whether the estimate will hold outside of this environment is a question purely of external validity. Formally, using the notation of HKM, with units <span class="math inline">\(i=1\ldots n\)</span> all observed at time <span class="math inline">\(t\)</span>, a population of samples <span class="math inline">\(y_{it}\)</span> is drawn from a distribution <span class="math inline">\(f(y_{it}\nu_t)\)</span> where <span class="math inline">\(\nu_t\)</span> represents the characteristics of the time and environment in which the sample is drawn, such a crosssectional study can conceivably identify ‘parameters’ <span class="math inline">\(\gamma(f)\)</span> which are functionals of the distribution <span class="math inline">\(f(y_{it}\nu_t)\)</span>. These functionals will in general depend on <span class="math inline">\(\nu_t\)</span>. If they are estimated by a procedure with standard limiting properties, <a href="http://cowles.econ.yale.edu/~dwka/pub/p1153.pdf">Andrews (2005)</a> formalizes the argument that they are consistent ‘conditional on’ $ _t$ by applying the machinery of stable convergence, which provides a variant of weak convergence in which the limit distribution, instead of, being, say, normal, is normal conditional on the information in <span class="math inline">\(\nu_t\)</span>.<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a> If the functional of interest does not depend on <span class="math inline">\(\nu_t\)</span>, or depends only on a subvector of <span class="math inline">\(\nu_t\)</span> which is shared with the target environment about which one wants to make a prediction, then the estimate can be considered to have external validity, otherwise it may not, which may or may not be a problem. For example, it may be that we are specifically interested in recovering some aspect of <span class="math inline">\(\nu_t\)</span>, like the return to education, which is determined by the structure of the labor market.</p>
<p>The above however is the most optimistic case, for two reasons. The first is that even if the true parameter is not a function of <span class="math inline">\(\nu_t\)</span>, its estimator might be, and so inference may be affected. Andrews provides limit theory for this case for linear regression and several similar estimators. More worryingly, one is often interested in parameters for which identification is affected by the presence aggregate uncertainty. HKM offer a simple example about educational choices, archetypal for a broad class of problems in which one models decisionmaking. The basic idea is that we are interested in some variable over which individuals have a choice, like college major or corporate fixed capital expenditure, either because we are interested in how the decision is made or because we want to study the effects of this variable and do not have (quasi)experimental variation. Standard practice then is to assume that the people making the decision are reasonably wellinformed and make the decision with potential costs and benefits in mind, and try to infer those (potentially subjective) costs and benefits by observing the decisions and the outcomes thereof. Letting the decision variable be <span class="math inline">\(y^1_{it}\)</span> and other individuallevel outcomes and characteristics as <span class="math inline">\(y^2_{it}\)</span>, and the marginal benefits and costs as functions of these variables, up to unknown parameters <span class="math inline">\(\theta\)</span>, optimal decisions are characterized by the first order condition <span class="math inline">\(E[MB(y^1_{it},y^2_{it},\theta)]=E[MC(y^1_{it},y^2_{it},\theta)]\)</span>. The assumption that agents are reasonably well informed is often translated to assuming that the distribution of uncertainty in the outcomes of the choice matches the true distribution of outcomes (up to some distortions parameterized by the subjective cost and benefit functions). If this is the case, the parameters can be estimated by replacing the expectations with the empirical measure and finding the parameters which minimize the deviation from equality. This is the generalized method of moments, and this application is its raison d’être, for better or worse the source of interest in this method by econometricians.<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a></p>
<p>So, what does including an aggregate shock <span class="math inline">\(\nu_t\)</span> do to this application? That depends on what it affects. Obviously if it enters directly into cost and benefit functions and can’t be represented by <span class="math inline">\(\theta\)</span>, this is an example of misspecification and causes bias, but there’s nothing particularly unique to this pitfall about it being an aggregate variable. The more interesting case is when the moments are correctly specified but the expectation is with regards to a set of outcomes which are affected by the aggregate uncertainty. The idea being, students decide to major in computer science, perceiving the expected wages to be high, but by the time they graduate, the industry slows down and the actual wages are not so high, or companies cut investment, perceiving that future sales will be low, but then demand picks up and the product is highly profitable. This could lead the naive researcher to two possible conclusions. If the cost and benefit functions are flexibly parameterized, the observation that most people made a decision with low observed returns could lead to estimates of subjective or unobserved benefits which are large, or costs which are low. The GMM estimates will say ‘students just really love computer science, so they take it even though the extrinsic rewards aren’t great.’ Alternately, if fewer free parameters are provided, this could lead to the model specification being rejected, with the possible conclusion that individuals are not making informed decisions: students just don’t have a clue what their pay will be when they graduate and so choose majors which don’t pay. In both cases, the conclusion would be changed if the expectation were over a measure containing uncertainty in <span class="math inline">\(\nu_t\)</span>: the cross section estimates converge to <span class="math inline">\(E[MB(y^1_{it},y^2_{it},\theta)MC(y^1_{it},y^2_{it},\theta)\nu_t]\)</span>, which need not equal 0, even if <span class="math inline">\(E_\nu [E[MB(y^1_{it},y^2_{it},\theta)MC(y^1_{it},y^2_{it},\theta)\nu_t]]=0\)</span>.</p>
<p>In the above example, in addition to more serious examples where the variation in aggregates must be incorporated explicitly into the model to be correctly specified, the solution is to use information on variation in the aggregates.<a href="#fn3" class="footnoteref" id="fnref3"><sup>3</sup></a> A simple way to do this is with long panels or repeated cross sections. A few words are in order regarding the asymptotics in these cases. An important point, emphasized by HKM, is that parameters identified by variation in aggregate variables generally have a distribution theory which is dominated by the aggregates. Consider the GMM example above, where aggregate variables do not enter explicitly into the formulation of the estimator at all. If the conditional distribution of the idiosyncratic variables is allowed to be affected by the aggregates in an arbitrary manner, the convergence of the empirical measure <span class="math inline">\(\frac{1}{NT}\sum_{t=1}^T\sum_{i=1}^N g(y_{it},\theta)\)</span> to the joint measure is at a rate which depends only on T! That is, if the aggregate variables are stationary and, say, mixing, and both N and T approach infinity, the error in the approximation is <span class="math inline">\(O(\sqrt{T})\)</span>. N doesn’t enter at all, so long as it grows to infinity.<a href="#fn4" class="footnoteref" id="fnref4"><sup>4</sup></a> The reason for this is pretty simple: these variables are not independent over i and t: all variables at a given time are drawn from a distribution which depends on <span class="math inline">\(\nu_t\)</span>, and so effectively, each crosssection counts as a single observation of an aggregate.</p>
<p>The case of completely arbitrary dependence creates difficulty for empirical researchers, given the relative rarity of data sets with long duration. While some effects may only be identified by this kind of variation (especially those which are “really” macroeconomic, like responses to aggregate variables, even if at the individual level), one can learn a lot from crosssectional variation even in the presence of aggregate shocks so long as there exists some kind of structure to the relationship. “Independence” is a very strong structure, but can reasonably be conjectured for some objects which have no reason to be affected by aggregate variables, or at least ones that vary at the scales one is interested in. More generally, additive effects constant across units are often imposed in estimation: in the case of panel data, this allows aggregate effects to be purged by the use of time dummies. Dropping the assumption of constant effect across units, time dummies and heteroskedasticity robust variance together allow purging of purely additive effects. This assumption is quite powerful, allowing rates of convergence to go from depending only on T to depending only on N (in certain cases: in nonlinear models, purely idiosyncratic variability may induce dependence of estimates on T). Still, it is a strong assumption since it rules out any interaction of aggregates with other variables. Structured interactions (multiplication with a time dummy) can also be handled with little cost in rates of convergence.</p>
<p>If more general relationships, such as linear factor structure, are desired, identification generally requires growth in both N and T. Here aggregate shocks are allowed to have systematically different effects on different units, but effects are linear and dimensionality of the heterogeneity is restricted. This particular structure and its variants have spawned a large literature: the case in which the factors and their effects are a nuisance parameter and the object of interest is a regression coefficient for an observable covariate is referred to as the ‘interactive fixed effects’ model: see the many contributions of <a href="http://wwwbcf.usc.edu/~moonr/research.html">H.R. Moon</a> on this topic. If one thinks of unit fixed effects as a type of permanent temporal dependence unchanged at all times for an individual and time effects as a type of cross sectional dependence unchanged at all times across individuals, interactive fixed effects allow for combinations of these two extremes. If one is willing to assume all microeconomic and macroeconomic effects can be subsumed within a linear model, this offers very substantial generality.</p>
<p>It is also popular to overcome these issues by imposing a structure which allows sequential identification of parameters regarding the crosssection which can then be used in a time series context, or vice versa. These methods are particularly popular in the fields of finance and accounting, where reasonably long panel data sets are available for assets or firms, and <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1175614">a number of heuristic approaches</a> to multistep inference have been developed. These are often called ‘FamaMacBeth regressions’ after the approach used by Fama and MacBeth in tests of the CAPM, which involved running timeseries regressions for each stock to find its beta and then using the estimated betas as data in a crosssection regression of returns on beta.</p>
<p>The case covered by HKM imposes a similar but more general class of structures, with the advantage of allowing any kind of nonlinearity in the crosssectional and aggregate effects and also the advantage of properly accounting for the aggregate uncertainty when constructing standard errors, which standard approaches often ignore. To be precise, what they impose is a kind of separability assumption: given one data set with a large cross section but short T and one data set with a long span but only aggregate variables, one can identify the full set of model parameters. The examples they give have the kind of ‘triangular’ identification structure as in FamaMacBeth, in which a subset of parameters are identifiable from one data set, and then once those parameters are known the other parameters are identifiable from the other, but the high level conditions they impose don’t require this, and the methods allow cases in which, for example, the cross section identifies the sum of two parameters and the time series their difference.</p>
<p>The idea is simple. Consider again the education choice example. While in principle the decision could depend on quite a variety of cultural, institutional, and macroeconomic factors, the most directly relevant one is the wages of occupations available to different majors. If we can model how these evolve and parameterize the decision rule as a function of wages, we may then hope to have a measure of reasonably expected costs and benefits of the choice at any fixed time which does not require observing a long time series of decisions in different environments, just observing the environment in the cross section we do have and plugging in a reasonable forecast given that environment into the decision rule. A nice feature of this method is that this forecast need not come from the same data set. If we have a separate data set with a long time series of wages by occupation, we can estimate a forecasting rule from that data and use the estimates to adjust the measures of expected costs and benefits in a data set with a crosssectional sample of individuals. With this information, one can either back out the subjective benefits of the decision or more accurately assess whether the decisions were ex ante reasonable and wellinformed. The downside of methods of this type relative to using a long panel under the assumption of arbitrary dependence is the need to specify a forecasting rule which can be estimated using only a time series of aggregate information (or possibly, such a time series and a small set of parameters which can be inferred from a cross section). This requires taking a stand on how to model aggregate variables and their effects on individuals, but then so does assuming no effect and proceeding with only crosssectional data.</p>
<p>Inference in this setup takes a form which is very similar to standard twostep estimation, to account for the estimation error in the forecasting rule when estimating the crosssection parameters. The paper also expends quite a bit of effort on the case in which the forecasting rule involves a unit root or nearunit root process. In addition to the usual complications from unit root limit theory,<a href="#fn5" class="footnoteref" id="fnref5"><sup>5</sup></a> there is a conceptual issue in this case, in that the prediction from a unit root process is history dependent, with initial conditions never washing out in the limit. Since the crosssectional data is observed at some point in the history of the process, the limiting uncertainty about the process is affected by the point in history at which the crosssection is observed.</p>
<p>The overall message of this line of inquiry is that ‘individual’ and ‘aggregate’ effects cannot always be cleanly separated, and that when one depends on the other, our knowledge of both may be limited by the one on which information is most scarce. Information can be economized on by imposing structure, but since generally speaking the information on aggregates is most limited, this may require explicit modeling of aggregates. In other words, seeking to answer microeconomic questions does not offer an escape from the need to think about macroeconomics. Dreary, right?</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Not to be confused with convergence to a <a href="https://en.wikipedia.org/wiki/Stable_distribution">stable distribution</a>. Stable limit theorems strictly speaking differ from convergence in distribution, as the limiting distribution in convergence in distribution need not live on the same probability space as the variables approaching the limit and so does not necessarily have any particular relationship with the conditioning variable. Stable convergence is, however, still weaker than the almost sure convergence ensured by, say, the <a href="https://en.wikipedia.org/wiki/Skorokhod%27s_representation_theorem">Skorokhod representation theorem</a> which ensures the existence of a representation of the data which is measurable with respect to the same sigmafield as he limiting variable, or strong coupling results like the the <a href="https://en.wikipedia.org/wiki/Koml%C3%B3s%E2%80%93Major%E2%80%93Tusn%C3%A1dy_approximation">KMT strong approximation</a> which find a sequence of approximating variables which live on the same space as the data. Instead, the limit variable lives on a separate sigma field which contains the subsigmafield generated by the conditioning variable but not the full sigma field on which all the data live. (See definition 1 in HKM for a formal statement). The advantage of threading the needle in this way is that one can retain the influence of the ‘aggregate’ variables while relying only on conditions much closer to those used to ensure weak convergence, like a standard (martingale) central limit theorem. As most <a href="https://ideas.repec.org/h/eee/ecochp/436.html">limit theory in econometrics</a> relies on weak convergence results, partly for robustness and partly for historical reasons, this approach can be applied to most standard and many nonstandard estimators.<a href="#fnref1" class="footnoteback">↩</a></p></li>
<li id="fn2"><p>It was later found that GMM offers an organizing framework for a wide variety of estimation procedures and nowadays it finds many applications divorced from this specific economic context. But Lars Hansen’s motivating examples were in asset pricing, where the consumption Euler equation takes precisely the form of equating subjective marginal cost and benefit.<a href="#fnref2" class="footnoteback">↩</a></p></li>
<li id="fn3"><p>As I understand it this is indeed being done in modern labor economics research, with substantial focus on how the decisions are influenced by the economic environment, making the example a bit of a straw man, but properly accounting for the uncertainty induced by this variation is still an important issue.<a href="#fnref3" class="footnoteback">↩</a></p></li>
<li id="fn4"><p>This result is similar to but not quite what HKM show in Section 3.1 where they discuss the issue informally, or in their stable limit theorem, which applies to a different case. In a class paper I wrote a few years ago there is a formal proof of a CLT for long panels with aggregate shocks by triangular array asymptotics which gives a precise statement. Interestingly, the proof is not my own: I explained the problem to Don Andrews during office hours and he derived the result within about 15 minutes.<a href="#fnref4" class="footnoteback">↩</a></p></li>
<li id="fn5"><p>Basically, establishing convergence to stochastic integrals by showing weak convergence by tightness in the <a href="https://en.wikipedia.org/wiki/C%C3%A0dl%C3%A0g#Skorokhod_space">Skorokhod topology.</a> While highly involved and nontrivial, the unit root case avoids one technical complication common to other models built on empirical processes, as the empirical process in the estimator for the unit root model is measurable in the sigmafield over the Skorokhod topology, while empirical processes in general need not be measurable, something of a complication for stable convergence. This complication shows up in a variety of nonsmooth or semiparametric estimators such as simulated method of moments or quantile regression which one might want to use for either the crosssection or timeseries component of the model. In the standard case this is resolved by the machinery of weak convergence in outer measure; I suspect that it could be extended to stable convergence in a straightforward fashion, but it remains to be done.<a href="#fnref5" class="footnoteback">↩</a></p></li>
</ol>
</div>

Why I’m blogging
/post/whyimblogging/
Wed, 03 Jun 2015 00:00:00 +0000
/post/whyimblogging/
<p>In an effort to get into the habit of writing down my thoughts, I am, illadvisedly, experimenting with starting a blog. I expect to cover mainly topics in statistics, econometrics, machine learning, and numerical computation, with some chance of also entertaining thoughts on how these relate to macroeconomics. Actual macroeconomics will be kept to a minimum, as that topic attracts sufficient attention online already, except in the case that I lack the selfcontrol to avoid arguing on the internet. I also expect that unless I begin to use it for brief summaries and comments, this won’t be frequently updated, and may die an untimely death in the face of more pressing matters. The main uses are likely to be working through concepts that I only partially understand, in order to clarify my thoughts and to provide a benchmark so that when or if some kind of an understanding is reached, there will be a public record on the internet of my <strike>ignorance</strike> intellectual evolution. If the record of such a process also serves to draw the interest of the reader into the topics that preoccupy me and direct them to more reliable sources, all the better.</p>

Why Laplacians?
/post/whylaplacians/
Wed, 03 Jun 2015 00:00:00 +0000
/post/whylaplacians/
<div>
<span style="fontsize: xsmall;"><u>Attention Conservation Notice</u>: Over 5000 words about math that I don’t particularly understand, written mostly to clarify my thoughts. A reader familiar with the topic (roughly, spectral or harmonic theory on graphs and manifolds) will find little here new except possibly misconceptions, while a reader not familiar with the topic will find minimal motivation and poorly explained jargon. The ideal reader is a pedant or troll who can tell me why I’m wrong about everything. Otherwise, if you want to learn the topic, maybe go read a <a href="http://arxiv.org/find/all/1/all:+EXACT+graph_Laplacian/0/1/0/all/0/1">random</a> <a href="http://www.journals.elsevier.com/appliedandcomputationalharmonicanalysis/">article</a> instead? </span><br />
<br />
</div>
<p>In the rapidly developing realm of data analysis on graphs and manifolds, as well as in a variety of fields of applied and maybe pure mathematics, a primary concern is characterizing the geometric structure of a space and describing the properties of the objects that live there.</p>
<p>A typical example of a question this field tries to answer is manifold estimation or unfolding: given a bunch of points, sampled from a (sub)manifold in a higherdimensional space, construct a reasonable estimate of that lowerdimensional shape on which the points live. <a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a> Though methods for the general case are rather new,<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a> when we have a <i>linear</i> subspace this is simply the familiar problem of Principal Components Analysis or Factor Analysis, dating back to the 1920s.</p>
<p>Alternately, we may only be interested in some limited set of geometric features of the space, such as its connected components: this is (one way to look at) the problem of clustering. If we are more ambitious, we might want to associate some type of data to the points of a space, which we may or may not know ex ante. Depending on the type of data, and on the underlying space, this goes by various names. When the space is known, it can be manifold regression, graph regression or smoothing, density estimation on manifolds or graphs, or <a href="https://lts2research.epfl.ch/gsp/">graph signal processing</a>. When the structure is unknown, but assumed to be approximately linear, one has reduced rank regression, principal components regression, or some category of factor model. To the extent that the assignment of the data bears some relationship to the underlying space (for example, the assignment is continuous or differentiable, or some <a href="http://en.wikipedia.org/wiki/Differentiable_manifold#Differentiable_functions">analogue thereof</a>), estimating this relationship must also take into account some of the geometric features of the space.</p>
<p>In an area wellexplored in differential geometry but only <a href="https://web.math.princeton.edu/~hauwu/SingerWu2012.pdf">beginning</a> to be <a href="http://arxiv.org/abs/1503.05459">explored</a> in geometric data analysis, the object to be related to the space of interest may itself incorporate some structure which is related to the structure of the underlying space. For example, it may live on or be some kind of <a href="http://en.wikipedia.org/wiki/Fiber_bundle">fiber bundle</a>, such as a vector bundle or tangent bundle, which describe flows over a manifold (or graph) and their directions and speed. So, one may consider (archetypally) wind on the surface of the earth, or messages over a network, or <a href="http://arxiv.org/abs/1406.0013">workers changing jobs as moving around in the space of possible occupations</a>.<a href="#fn3" class="footnoteref" id="fnref3"><sup>3</sup></a> These spaces themselves have a manifold structure which can be estimated or exploited in characterizing the data. One can likewise imagine (if you’re the kind of person who does that sort of thing), extending this kind of analysis to <a href="http://www.math.upenn.edu/~ghrist/EAT/EATchapter9.pdf">more general structures.</a></p>
<p>Given the rather general nature of these spaces, one has many options for describing their geometric properties. A graph is defined in terms of its nodes and edges and a manifold in terms of its <a href="http://en.wikipedia.org/wiki/Atlas_(topology)">charts and transition maps</a>, but these are hard to work with and we would like an object from which we can easily compute whatever properties we need of the space. Some space of objects isomorphic to (or, in a pinch, surjective onto) the space of all (relevant) manifolds or graphs seems like the ideal choice, and if one can easily extract summaries, in some sense, even better.</p>
<div id="whatarelaplacians" class="section level3">
<h3>What are Laplacians?</h3>
<p>While not universal, for a large portion of the people working in this field, the object of choice is overwhelmingly the <i><a href="http://en.wikipedia.org/wiki/Laplace_operator">Laplacian</a></i>, denoted <span class="math inline">\(\Delta\)</span> or <span class="math inline">\(\nabla \cdot \nabla\)</span> or <span class="math inline">\(\nabla^2\)</span>, a differential (or, for graphs, difference) operator over functions defined on the space. This is usually defined as the divergence of the gradient: for a Euclidean space <span class="math inline">\(\mathbb{R}^n\)</span>, this is simply the sum of the second partial derivatives in each direction <span class="math inline">\(\Delta=\sum_{i=1}^n \partial_i^2\)</span>. Over a graph, if D is the matrix with the number of edges of each node (in some fixed but arbitrary order) on the diagonal and A the adjacency matrix with a 1 in position (i,j) if there is an edge between i and j, and a 0 otherwise (including on the diagonal), the Laplacian is simply <span class="math inline">\(L:=DA\)</span>. While in the graph case it is easy to see that this object contains all the information in the structure of the graph, the form of a linear operator, and why this particular linear operator, always seemed a bit mysterious to me. So, I thought it would be worthwhile to lay out my thoughts on different ways of looking at this object that might explain how it’s used and why it takes pride of place in structured data analysis.</p>
One issue is that there are many different definitions of the Laplacian because there are many different definitions of a derivative, gradient, and divergence, depending on the space in which they act and other desiderata. Given some definition of a gradient, divergence is usually defined as the adjoint, with the inner product determined by the space, and so many cases are identical up to the choice of space and derivative. For example, the graph case corresponds to taking differences over connected vertices.<br />
<ul>
<li>
Over manifolds, where the derivative no longer can be represented as an element of the space, there are at least two common formulations:
</li>
<ul>
<li>
On Riemannian manifolds, the LaplaceBeltrami operator (or connection Laplacian) is defined in terms of the LeviCivita connection, a covariant derivative in the direction of the tangent space, or equivalently as the trace of the Hessian (defined in terms of covariant derivatives).
</li>
<li>
Using instead the <a href="http://en.wikipedia.org/wiki/Exterior_derivative">exterior derivative</a> <span class="math inline">\(\partial\)</span> and codifferential <span class="math inline">\(\delta\)</span>, which act on the <a href="http://en.wikipedia.org/wiki/Exterior_algebra">exterior algebra</a> of differential forms of a space, one can construct the LaplaceDe Rham operator (or Hodge Laplacian) on differential forms as <span class="math inline">\(\partial\delta +\delta\partial\)</span> .
</li>
</ul>
<li>
If the graph Laplacian is seen as generalizing the LaplaceBeltrami operator to graphs, the <a href="http://arxiv.org/pdf/1310.5099v1.pdf">combinatorial Laplacian</a> can be seen as generalizing the Laplacede Rham operator to simplicial complexes, higherorder structures over graphs, with exterior derivative and codifferential replaced by boundary and coboundary maps, respectively, with differences among faces connected by a coface or cofaces connected by a face.
</li>
</ul>
Even more exotic spaces and derivatives give similar analogues, of varying degrees of applicability. Generating new Laplacians seems to be a hobby for mathematicians, so this is a highly incomplete list.<br />
<div>
<ul>
<li>
Malliavin calculus induces a gradient on stochastic processes, the Malliavin derivative, and a corresponding adjoint, the Skorohod integral, familiar in the case of adapted processes as the It<span style="backgroundcolor: white; color: #252525; fontfamily: sansserif; fontsize: 14px; lineheight: 22.3999996185303px;">ō</span> integral. Successive application yields a stochastic analogue of the Laplacian, the OrnsteinUhlenbeck operator, so named since it is the infinitesimal generator of a standard OrnsteinUhlenbeck process, the “continuous AR(1)”
</li>
<li>
For stochastic processes defined over more general manifolds, one may generalize the “usual” Malliavin derivative and OU operator analogously to how one generalizes the Euclidean derivative to manifolds, <a href="http://www.xuemei.org/ICMart.pdf">though apparently some difficulties arise</a>.
</li>
<li>
The <a href="http://en.wikipedia.org/wiki/Casimir_element">Casimir operator</a> plays the role of the Laplacian on a Lie algebra, and its <a href="http://en.wikipedia.org/wiki/Automorphic_form">eigenelements</a> play a role in harmonic analysis of these spaces similar to the role played by eigenfunctions of the Laplacian in standard harmonic analysis.
</li>
<li>
The <a href="https://video.ias.edu/speciallectures/1213/0326JeanMichelBismut">hypoelliptic Laplacian</a> does… something. Honestly I was hopelessly lost ten minutes into this presentation.
</li>
</ul>
</div>
</div>
<div style="textalign: left;">
<br />
</div>
<div style="textalign: left;">
<h3 id="applicationshowthelaplacianisused">Applications: How the Laplacian is Used</h3>
<p><br />
While Laplacians can be described for many spaces, data usually takes the form of a collection of points which may or may not be known to lie in (or near) the space of interest, which often must be estimated. To determine this space or functions on it, many procedures start by estimating the Laplacian or some of its properties, in particular its eigenvalues and eigenfunctions, and possibly some functions constructed from these functions, and these objects are taken to define the space of interest.<br />
<br />
The way this works is as follows: from a set of points, if they do not already have a known graph structure, one can construct a (possibly weighted) graph by connecting nearby points (in some metric, usually Euclidean). One then takes the graph Laplacian of this graph and calculates its eigenvectors. Under some standard conditions, if more and more points are drawn from a compact submanifold in Euclidean space, the graph Laplacian <a href="http://knight.cis.temple.edu/~latecki/Courses/RobotFall08/Papers/DiffusionMaps06.pdf">converges</a> to the LaplaceBeltrami operator of the manifold and the eigenvectors converge to its eigenfunctions. The proofs of this are remarkably straightforward, essentially copying convergence results for kernel density estimators.<br />
<br />
When the manifold to be considered has intrinsic dimension smaller than that of the ambient space, it can be well described by only the first few eigenvectors, which define a coordinate system over the manifold. For the linear case, this is the principle of multidimensional scaling; in the nonlinear case, it is the idea behind Laplacian Eigenmaps and Diffusion Maps as manifold estimation methods. In defining a coordinate system on the manifold or graph, these vectors can also be used to parameterize distance in the data set, and so give a sense of what parts of the data are closely connected.<br />
<br />
One of the most popular uses of this measure is to detect related communities or clusters in a data set. Spectral clustering takes the eigenvector corresponding to the secondsmallest eigenvalue of the (possibly weighted or perturbed, depending on the estimation method) symmetrized graph Laplacian <span class="math inline">\(D^{1/2} L D^{1/2}\)</span> and partitions it (via any standard clustering algorithm, usually kmeans). If the graph has multiple connected components, this recovers them: there are also consistency guarantees for recovering communities which are not completely disconnected, such as the under the <a href="http://arxiv.org/abs/1312.1733">stochastic block model</a>, or a <a href="http://arxiv.org/abs/1406.3387v1">variety of clustering measures</a>.<br />
<br />
Distance measures also give a definition of local which is intrinsic to the structure of the data and possibly lower dimensional than the ambient space, which makes them useful to describe what it means for a function on the data to be continuous or smooth: it should not vary too much over small distances. Since the Laplacian is a differential operator, it acts on differentiable functions, and its eigenfunctions are differentiable and can be taken to <i>define</i> what it means to be smooth over a graph or manifold. The sequence of eigenfunctions then provides a Fouriertype basis which can be used to approximate arbitrary smooth functions over the space on which the data concentrates. One may also decompose these functions into local components at multiple scales to generate versions of wavelet analysis. On a known manifold, such a construction generates <a href="http://www.statslab.cam.ac.uk/~nickl/Site/__files/ptrf12.pdf">needlets</a>, on a fixed graph one can produce <a href="https://lts2research.epfl.ch/gsp/doc/demos/gsp_demo_wavelet.php">graph wavelets</a>, and using the estimates of the manifold provided by diffusion maps one can generate <a href="http://www.math.duke.edu/~mauro/Papers/DiffusionWavelets.pdf">diffusion wavelets</a>. These can then be used like conventional wavelets for smooth function and density estimation and compression.<br />
<br /></p>
<h3 id="sowhydoesitwork">So why does it work?</h3>
<br />
One reason the Laplacian appears to be useful for data analysis is that it appears to be useful for tons of things: it is a core object in a number of mathematical subjects and captures the relationship between them. Each perspective provides a different justification for the Laplacian, and also suggests directions for generalization or improvement.
</div>
<div style="textalign: left;">
<br />
</div>
<div style="textalign: left;">
<ol style="liststyletype: decimal">
<li>Functional Analytic
</div>
<div style="textalign: left;">
<br />
</div></li>
</ol>
<p>One way to see the Laplacian is that it is a particular example of a linear operator on a function space, with some nice properties (it is self adjoint) and some more peculiar properties (it is unbounded, acting continuously only on twicedifferentiable functions). This small domain might also be an advantage, since it defines a subset of the class of all functions which is small<a href="#fn4" class="footnoteref" id="fnref4"><sup>4</sup></a> but dense. This space, in the Euclidean case, is a classical Sobolev space. More usefully, the spectral decomposition of this operator, by selfadjointness, generates an orthonormal eigenbasis which provides basis functions which can provide uniformly accurate finite approximation of functions in the space. In the 1dimensional periodic case this is the Fourier basis, and the eigenvalues order the basis functions by frequency. This can be seen by noting that the second derivative of a sine or cosine is again a sine or cosine, up to sign and scale. This fact can be taken as a definition of the Fourier transform, and provides one means of generalizing the transform to more general spaces like graphs or manifolds.<a href="#fn5" class="footnoteref" id="fnref5"><sup>5</sup></a> Given the huge set of things one can do with a Fourier transform (filtering, wavelet analysis, sieve estimation, fast algorithms), the utility here is obvious.</p>
<br />
The question which this approach raises is what is special about the Laplacian, as opposed to some other operator which could generate an ordered orthonormal basis of eigenfunctions. Here I don’t have the greatest intuition: certainly the use of a differential operator encodes smoothness in a way which seems natural, with the idea that local changes are not too large meaning that the second derivative is not excessively large (on average over the space).<br />
<br />
Another heuristic justification for this kind of choice is that since the derivative gives local information, it is wellsuited to applications based on data, since one will never obtain the value of a function except at a finite number of points, so if one wants global information, it is necessary to impose a structure which allows interpolating between points. However, this principle at best gives that a differential structure is sufficient, rather than necessary: ability to interpolate or estimate functions from points is controlled more by the size of the function space (according to <a href="http://www.is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdf2819.pdf">some complexity measure</a>) than smoothness per se, though the function class must be able to eventually distinguish functions based on evaluation at a finite but growing number of points.<br />
<br />
One aspect of this space beyond low complexity which could be useful for data analysis is that it forms a reproducing kernel Hilbert space. In principle this indicates that estimation, interpolation, and inverse problem solution can take advantage of the explosion of computationally convenient procedures for spaces of this type based on Kimeldorf and Wahba’s <a href="http://en.wikipedia.org/wiki/Representer_theorem">representer theorem</a> and Vapnik’s <a href="http://www.oneweirdkerneltrick.com/">kernel trick</a> that form the key ideas behind smoothing splines, support vector machines, and related kernel methods. In practice, these tools are mainly useful if one can easily calculate the reproducing kernel, which is why practitioners tend to rely on ad hoc choices like Gaussian, exponential, or polynomial kernels rather than the kernel induced by the Sobolev space. While it is possible to find this object for a specific space by solving a set of PDEs (Wahba does this for classical Sobolev spaces on the interval and on the sphere in her <a href="http://epubs.siam.org/doi/book/10.1137/1.9781611970128">classic book on splines</a>), it is again easier to apply methods which <a href="http://www.cs.columbia.edu/~risi/papers/SmolaKondor.pdf">induce an easily computable kernel function</a>, corresponding to tractable deformed versions of the Laplacian.<br />
<br />
One property of the Fourier transform which is not in general preserved, at least for inhomogeneous spaces, is translation invariance. This is what makes Fourier analysis useful in the timeseries context, preserving the features of a stationary series and allowing consistent inference from a single observed series. It might reasonably argued, though, that if the space does not have homogeneity properties of some sort, that one should not expect to be able to extrapolate outside of the range of observation, at least not without some kind of known or locally learnable global structure.<a href="#fn6" class="footnoteref" id="fnref6"><sup>6</sup></a> Since the eigenfunctions of the Laplacian are defined globally, they do impose some structure on functions over areas where data is not observed, including places which are structurally different from the area of observation. On the other hand, since they are defined in terms of smoothness, these global features are less informative far away from the area of observation, and I suspect that proper uncertainty quantification would therefore reflect higher uncertainty in these regions. This is likely to be particularly true of wavelet methods, which have substantially better localization properties.<br />
<br />
</div>
<div style="textalign: left;">
<br />
</div>
<div style="textalign: left;">
<ol start="2" style="liststyletype: decimal">
<li>Probabilistic
</div>
<div style="textalign: left;">
<br />
A justification for Laplacian methods which provides a much stronger justification for the particular form of the operator is that it encodes a particular probabilistic model on the space. In the case of a graph, this is a standard random walk along the edges, with a normalized version of the Laplacian (<span class="math inline">\(ID^{1}L\)</span>) being the Markov transition matrix for a process defined on the vertices and transitioning at random along the edges from that vertex. In the continuum limit, the eigenfunctions of the Laplacian characterize a diffusion process in the ambient space (in particular, <a href="http://arxiv.org/abs/math/0506090">they are the eigenfunctions of the associated FokkerPlanck operator</a>), with drift determined by the density of points over the manifold.<br />
<br />
The random walk structure helps elucidate a number of properties of the Laplacian in characterizing the structure of the space. First, since a random walk is a local process, over short time scales it moves locally. Over long time scales, over finite graphs, the walk converges to a limit distribution, which describes the global structure. This may fail to be unique if, for example, the graph is not connected. This interpretation also explains the role of the spectrum of the Laplacian. A steady state is an eigenvector of the Markov transition matrix corresponding to the eigenvalue 1, i.e. <span class="math inline">\((ID^{1}L)v=v\)</span>, which exists by the <a href="http://en.wikipedia.org/wiki/Perron%E2%80%93Frobenius_theorem">PerronFrobenius theorem</a>. For a Markov chain which does converge to a unique steady state, the speed of this convergence is given by the size of the second eigenvalue: if this is one, the process has multiple steady states, which occurs when there are disconnected components of the graph. If it is near one, convergence occurs very slowly, as would be expected if there are regions of a graph where a walk would remain stuck with high probability because there are few links outside and many links inside. These cases correspond to existence of disconnected or nearly disconnected clusters in a graph, and explain the usefulness of spectral clustering. Quantitatively, the control provided over the structure by this eigenvalue is given by the <a href="http://en.wikipedia.org/wiki/Expander_graph#Cheeger_inequalities">Cheeger inequality</a> (note that due to sign conventions, the distance of the second largest eigenvalue from 1 of the Markov transition matrix corresponds to the distance of the second smallest eigenvalue from 0 for the Laplacian).<br />
<br />
The idea of Laplacian as coming from a random walk also gives, to me, the most intuitive justification for why a second derivative is used: it comes from Itō’s lemma. As the step size goes to zero, a pure random walk on a space converges, by a functional CLT, to a standard Brownian motion, call it <span class="math inline">\(W_t\)</span>. Now consider an arbitrary twice differentiable function on for simplicity, <span class="math inline">\(\mathbb{R}^n\)</span>, <span class="math inline">\(f(.):\mathbb{R}^n\to\mathbb{R}\)</span>. By Itō’s lemma, Taylor expanding <span class="math inline">\(f(.)\)</span>, obtain <span class="math inline">\(df(W_t)= \frac{1}{2} \Delta f(W_t)dt+\nabla f(W_t)dW_t\)</span>, and we see that a random walk induces a drift for any function which is precisely (1/2 of) the Laplacian. Applied to the density <span class="math inline">\(p(x,t)\)</span>, this gives the adjoint Markov operator, or FokkerPlanck equation for the density, <span class="math inline">\(dp(x,t)=\frac{1}{2} \Delta p(x,t)dt\)</span>, the continuous analogue of the transpose of the Markov transition matrix, which instead of mapping state to state maps distribution over states to distributions over states.<a href="#fn7" class="footnoteref" id="fnref7"><sup>7</sup></a> <br />
<br />
So, we can see that one question the Laplacian answers is, given a density over my space, how will it evolve under a random walk. This of course raises the question of why the process needs to be a pure random walk, as opposed to some other process, which might be driven by nonspherical noise, have some drift or jump component, or depend on the location in the space. The answer is that it depends on the application. If nothing about the local structure of the space is known, it makes sense to have a procedure which is locally agnostic. On the other hand, if one is using the induced decomposition of the space to characterize objects arising from a structurally biased process, it may be useful to take that into account. This is especially the case if the methods are being used to describe something which actually does diffuse over a network or manifoldlike space, in which case the local distance ought to be in terms of the characteristics of this process, which may be asymmetric or behave differently at boundaries.<a href="#fn8" class="footnoteref" id="fnref8"><sup>8</sup></a> Modifications here should be domain specific, and indeed procedures like spectral clustering have been adapted in this way to different characterizations of communities.<a href="#fn9" class="footnoteref" id="fnref9"><sup>9</sup></a> <br />
<br />
As an example, consider the utility of a process consisting of a random walk plus jumps which land at a random point on the space. This will induce a nonlocal component to the movement, and so induce connections between disconnected or weaklyconnected components (which may appear disconnected under sampling). This additional component provides a sort of protection against unmeasured connectedness, and induces the commonly used regularized Laplacian. In particular, this generates the random surfer model underlying Google’s PageRank algorithm, arguably the most successful application of spectral graph methods.<br />
<br />
</div>
<div style="textalign: left;">
<br />
</div>
<div style="textalign: left;"></li>
<li>Differential equations<br />
<br />
Being a differential operator, probably the most obvious perspective from which to consider the Laplacian is (partial) differential equations, in which second order equations, in which the Laplacian is (part of) the leading term, comprise the core of the subject. The Laplacian shows up in the canonical Poisson equation <span class="math inline">\(\Delta u(x) = f(x)\)</span>, the Heat equation <span class="math inline">\(\partial_t u(x,t) = \Delta u(x,t)\)</span>, the wave equation <span class="math inline">\(\partial_{tt} u(x,t) = \Delta u(x,t)\)</span>, and a large variety of equations with lower order terms, both linear and nonlinear.<br />
<br />
In economics (or at least the branches with which I am most familiar), second order PDEs show up most frequently in stochastic optimization problems, in which some noise follows a diffusion process, and an agent chooses a control to maximize a discounted reward over time. With discount rate r, control <span class="math inline">\(c_t\)</span>, noise <span class="math inline">\(dx_t=\mu (x_t,c_t)dt+\Sigma dW_t\)</span> and reward <span class="math inline">\(u(x_t,c_t)\)</span>, this leads to the HJB equation for the value conditional on any given state as
<span class="math display">\[rV(x_t,t)=\underset{c_t}{\max} \{u(x_t,c_t) + \partial_t V(x_t,t)+\langle\nabla_x V(x_t,t)), \mu (x_t,c_t)\rangle + \frac{1}{2}Tr(\Sigma^T H_x V(x_t,t)\Sigma)\}\]</span>
where the last term, the trace of the Hessian of <span class="math inline">\(V\)</span>, is the weighted Laplacian. One can add boundary conditions (which may also be controlled), constraints, and so on to describe the problem of interest. The simplest case, with a geometric Brownian motion as state variable, no instantaneous payoff, and the only control a binary choice over a terminal boundary condition <span class="math inline">\(V(x_T,T)=\max\{x_Tp,0\}\)</span> for constant <span class="math inline">\(p\)</span>, gives the wellknown BlackScholes formula for option pricing, though the idea is applicable in more general contexts.<br />
<br />
To the extent that solutions inherit features of the space from the second order differential structure, the value function <span class="math inline">\(V(x,t)\)</span> should be characterizable in terms of the properties of the Laplacian. This idea has <a href="http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_498.pdf">attracted attention</a> in the field of reinforcement learning (usually used to describe the discrete time version of this optimization problem), using Laplacian eigenfunctions and derived wavelets as a basis for value function approximation. In cases where the structure of the space is an unknown manifold, this method has the additional advantage of incorporating nonlinear dimension reduction, potentially providing substantial computational advantages over structured domains. As in the probabilistic case, when the driving process is not a pure random walk, it may be advantageous to incorporate structure by using, say, the FokkerPlanck operator and not just the standard Laplacian. However, in the case of substantially nonlinear reward or control, the linear component may provide a poor guide to the behavior of the solution, unlike in the BlackScholes case where it is exact.<br />
<br />
For its uses in more general data analysis, the PDE which is most directly applicable is the Heat equation, <span class="math inline">\(\partial_t u(x,t) = \Delta u(x,t)\)</span>, which, up to a constant, is exactly what I called a FokkerPlanck equation above.<a href="#fn10" class="footnoteref" id="fnref10"><sup>10</sup></a> Applying standard methods for solution of initial value problems then gives a semigroup which tells us for any t>0 the distribution resulting from any given initial state <span class="math inline">\(u_0 (x)\)</span>, in the form of the Heat Kernel, denoted <span class="math inline">\(e^{t\Delta}\)</span>, which on Euclidean space takes the form <span class="math inline">\(e^{t\Delta} [u_0](y) = \int_{\mathbb{R}^n}\phi(\frac{xy}{2t})u_0 (x) dx\)</span>, where <span class="math inline">\(\phi\)</span> is the standard Gaussian density on <span class="math inline">\(\mathbb{R}^n\)</span>.<br />
<br />
In other words, the solution is convolution with an increasingly dispersed Gaussian, as should be expected from the interpretation of adding Gaussian noise to each point. This gives another way of seeing the connection with Fourier analysis, since by the convolution theorem, any convolution operator acts componentwise on the Fourier frequencies by multiplication. The heat kernel then can be viewed as a form of lowpass filter, which downweights high frequency components by an amount increasing in t. Spectral methods based on the Laplacian, which restrict to the first few eigenfunctions and discard the rest, sharpen this characterization, generating an exact (continuous) lowpass filter. From a statistical perspective, this can be seen as a series estimator of the density. In contrast, applying the heat kernel can be seen directly as a kernel density estimator: given an initial condition which is noisy (say, an empirical distribution), the heat kernel outputs a regularized version which is smooth: the time parameter plays the role of the bandwidth. <br />
<br />
</div>
<div style="textalign: left;">
<br />
</div>
<div style="textalign: left;"></li>
<li>Statistical<br />
<br />
The interpretation of Laplacian eigenfunctions and heat kernel smoothing as sieve and kernel methods respectively suggests evaluation of these methods for their statistical properties. Which of these methods is preferable depends on what one believes about the distribution over the graph or manifold, though as is generally the case with sieve versus kernel methods, asymptotic convergence rates are the same for smooth functions, and the decision of which to use usually comes down to other desiderata (e.g. the kernel method outputs a proper density, while the sieve filter is idempotent). More generally, the use of the heat equation picks out a single sieve basis (the Laplacian eigenfunctions) and a single kernel (the Gaussian), which can be compared to other sieves and kernels, which may have more desirable properties depending on the conjectured properties of the objects living on the space. It is wellknown, for example, that the Gaussian kernel, as a proper kernel, yields suboptimal convergence rates for highly smooth densities. Since both produce isotropic representations, they also don’t deal with smoothness that may differ by direction (which is especially tricky in the nonEuclidean case since one often lacks canonical directions) or location. Sparse wavelet based methods, as usual, can help here, though it seems to me that proper analysis of optimal wavelet denoising over graphs and manifolds remains understudied.<br />
<br />
As for more general alternatives, there are a large number of methods now in the machine learning and statistical literature, many of which rely on Laplacian based methods, via sieve or Reproducing Kernel Hilbert Space methods, the latter including several Bayesian approaches. Estimation over graphs and manifolds remains a hot topic, with a variety of approaches, only some of which rely on Laplacian structure. As in every other aspect of machine learning these days, <a href="https://www.youtube.com/watch?v=xk17mfFxkag">neural networks over manifolds</a> provide a promising alternative. One also has a choice between penalization <a href="http://arxiv.org/abs/1410.7690">methods</a> of <a href="http://arxiv.org/abs/1411.7414">various</a> <a href="http://arxiv.org/abs/1505.00290">kinds</a> for graphbased predictions. The choice between these seems essentially similar to the choice of penalization in standard nonparametric prediction problems, with Laplacianbased methods corresponding to <span class="math inline">\(L^2\)</span> approaches like splines and support vector machines: a tool in the toolbox and a good first choice for many problems, but not the last.<a href="#fn11" class="footnoteref" id="fnref11"><sup>11</sup></a> <br />
<br />
</div>
<div style="textalign: left;">
<br />
</div>
<div style="textalign: left;"></li>
<li>Topological<br />
<br />
Beyond representing the space on which they live, part of the utility of Laplacian based methods is that important characteristics of the space can easily be “read off” of the Laplacian, especially the spectrum, for some definition of “important.” The use in clustering suggests that this is the degree of connection between the components of the space, a <i><a href="http://www.math.upenn.edu/~ghrist/notes.html">topological</a></i> feature, or approximately one. In particular, it provides a measure of the 0th homology group of the space. This does not seem to be coincidental. The homology of a topological space (components, holes, voids, etc) can be determined by constructing a simplicial complex from the space, considering not just links between pairs of points, but sets of 3, or 4, and so on, as higher analogues of edges of a graph, which is a first order complex. The homology of each order is determined by the simplices of that order, so graph information is informative about the connected components. By extending the approximating space to higher orders, one can determine also the higher order homologies. Constructing a Laplacian over these higher order connections to obtain the <a href="http://arxiv.org/abs/1310.5099">combinatorial Laplacian</a> then in fact gives an operator whose spectrum encodes the higher order homologies in the same way the graph Laplacian describes clustering.<br />
<br />
In practice, there doesn’t seem to be a lot in the way of “spectral homology” to give higher order analogues of spectral clustering. Instead, <a href="http://web.cse.ohiostate.edu/~tamaldey/course/CTDA/CTDA.html">topological data analysis</a> relies mostly on versions of persistent homology, which computes the homology group directly from the complex constructed from connecting nearby points at each length scale. The difference seems to be due to different notions of robustness to noise. In spectral clustering, one claims to have clusters even if the graph is connected so long as there are relatively few links between components, with “few” measurable by a small second eigenvalue of the Laplacian. Persistent homology will instead only show existence of disconnected components if the graph actually is disconnected, for edges connected between sufficiently close points. If I understand this properly (not likely), the analogue for 1st homology group would be to say that there exists an “approximate” hole in the space if there are few paths across a gap relative to paths around it, where few could be measured by the second eigenvalue of a higher order Laplacian. Whether this is useful depends on the model for the space: for sampling exactly from a manifold, one will never get points in the “wrong” place and so persistent homology should describe its holes. If the data is instead drawn from a model with noise (interesting question: is there a simplicial analogue of the stochastic block model?), one will expect outliers and so even if data concentrates on a torus, the persistent homology may not find the bulk of the torus before the hole in the point cloud is completely filled in. To put it visually, the relative merit of these methods for different applications depends on whether you want your algorithm to tell you that a <a href="http://www.thefreshloaf.com//files/u15122/bialy640.jpg">bialy</a> is approximately a bagel, or approximately an onion roll.
</div>
<div style="textalign: left;">
<br />
</div>
<div style="textalign: left;">
<br />
<u><b>Summary</b></u><br />
<br />
Considering these methods, it gets easier to see why, if you knew not so much about a space, you would first take a look at the Laplacian over it. It describes local and global properties, gives a reasonable definition of smoothness and random processes on a space, and permits translation of the wide variety of second order (Hilbert space) statistical smoothing methods to strangelooking domains. It should be modified if a little more is known about the way things move on the space, and is probably <a href="http://en.wikipedia.org/wiki/Fourier_transform#Uncertainty_principle">exactly the wrong approach</a> for considering at individual points, but it can tell us a lot if we just want to know how our data are put together.
</div></li>
</ol>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>For the purposes of data analysis, one usually thinks of a manifold this as a smooth curved, lowerdimensional shape in higherdimensional Euclidean space, like a piece of paper wiggling in the air, or folded into a tube, or a wavy line drawn on a sheet of paper, or, maybe, the set of possible newspaper articles from the business section in the space of strings of letters or numbers. That last case may or may not satisfy the “smooth” or “low dimensional” characterization: we take such a geometric structure as a model, or <i><a href="http://colah.github.io/posts/201403NNManifoldsTopology/">hypothesis</a></i>. The space can instead be defined without reference to Euclidean space, and probably ought to be, but we know, <a href="http://www.iflscience.com/physics/everyworldgrainsandjohnnashsastonishinggeometry">thanks to Nash</a>, that this characterization doesn’t lose us anything.<a href="#fnref1" class="footnoteback">↩</a></p></li>
<li id="fn2"><p>To my knowledge, the recent explosion of interest in the idea of modeling data as living on an arbitrary submanifold begins with the 2000 publication in <i>Science</i> of <a href="http://isomap.stanford.edu/">Isomap</a> and <a href="http://www.cs.nyu.edu/~roweis/lle/">Locally Linear Embedding</a>, though plenty of antecedents exist including <a href="http://www.slac.stanford.edu/cgiwrap/getdoc/slacr276.pdf">various</a> nonlinear <a href="http://en.wikipedia.org/wiki/Kernel_principal_component_analysis">generalizations</a> of PCA, and the field of <a href="http://qh.eng.ua.edu/e_paper/e_book/Ameri_book_Methods_of_Information_Geometry.pdf">information geometry</a>, which views the space of statistical <i>models</i> as a manifold.<br /><a href="#fnref2" class="footnoteback">↩</a></p></li>
<li id="fn3"><p>As usual, this generalizes a linear structure, the idea of the <a href="http://www.econ.yale.edu/~gm76//bmdynamicNBER_MA.pdf">job</a> <a href="http://www.econ.yale.edu/~gm76/job_ladder_GR.pdf">ladder</a> that workers move up or down, to allow motion in a more general space. <br /><a href="#fnref3" class="footnoteback">↩</a></p></li>
<li id="fn4"><p>An infinite dimensional space can be “small” if it can fit easily <a href="http://en.wikipedia.org/wiki/Sobolev_inequality#Sobolev_embedding_theorem">inside a bigger space</a>, or if you <a href="http://www.cambridge.org/us/academic/subjects/mathematics/abstractanalysis/entropycompactnessandapproximationoperators">can fill it up with not too many things</a>.<br /><a href="#fnref4" class="footnoteback">↩</a></p></li>
<li id="fn5"><p>Other characterizations provide features which may preserve other useful properties of the Fourier transform in different kinds of spaces. Translation invariance and the convolution theorem, useful characteristics of the Fourier transform on the circle (viewable as a group with addition as the group operation), can be generalized to other groups by <a href="http://en.wikipedia.org/wiki/Pontryagin_duality">Pontryagin Duality</a>, allowing also generalization of the idea of a stationary stochastic process and its spectral decomposition by <a href="http://en.wikipedia.org/wiki/Bochner%27s_theorem">Bochner’s theorem</a> to processes invariant with respect to other transformations. For <a href="http://en.wikipedia.org/wiki/Lie_group">groups with a manifold structure</a>, one has a choice of generalization, though in many cases the Fourier decomposition given by the Laplacian and by Pontryagin duality coincide, e.g., on a sphere, where both interpretations give the spherical harmonics.<br /><a href="#fnref5" class="footnoteback">↩</a></p></li>
<li id="fn6"><p>One could of course also say this about the time series case, as stationarity is a very strong assumption. This provides motivation for using instead methods which, to the extent that they use global structure, do not impose the specific global structure implied by stationarity: see e.g. <a href="http://wwwstat.wharton.upenn.edu/~rakhlin/papers/hierarchies.pdf">Rakhlin and Sridharan</a> on forecasting nonstationary processes on graphs.<br /><a href="#fnref6" class="footnoteback">↩</a></p></li>
<li id="fn7"><p>Does this result extend from <span class="math inline">\(\mathbb{R}^n\)</span> to arbitrary Riemannian manifolds? Apparently, <a href="http://mathoverflow.net/questions/126368/referenceneededdonskersinvarianceprincipleforriemannianmanifolds">yes</a>, with the Laplacian on <span class="math inline">\(\mathbb{R}^n\)</span> replaced by the LaplaceBeltrami operator, as desired, though it seems to take a little bit of careful thinking to formulate a random walk with no preferred direction.<br /><a href="#fnref7" class="footnoteback">↩</a></p></li>
<li id="fn8"><p>To the extent that the stochastic structure is (partially) unknown, this induces an additional hyperparameter tuning problem to these methods, though one that seems no more or less amenable to standard methods than usual in nonparametric methods: crossvalidation, empirical and hierarchical Bayes procedures used for Gaussian process estimation, and probably even Lepskiitype upper bound methods used for anisotropic kernels, all ought to have analogues. <a href="http://wwwstat.wharton.upenn.edu/~tcai/PapersbyTopics.html#Wavelet.Thresholding">Shrinkagebased adaptive methods</a> may be a bit less generally applicable, since while uncertainty about the process which involves pure rescaling can be accommodated with a fixed set of basis functions, generic perturbations to the Laplace operator will not preserve eigenfunctions and so related models will not have a hierarchical structure.<br /><a href="#fnref8" class="footnoteback">↩</a></p></li>
<li id="fn9"><p>Albeit mostly to versions of the stochastic block model, which is itself kind of a silly little toy model, but attracts interest as a starting point for <a href="http://arxiv.org/abs/1312.7857">much</a> <a href="http://bactra.org/notebooks/graphlimits.html">deeper</a> <a href="https://terrytao.wordpress.com/2012/12/03/thespectralproofoftheszemerediregularitylemma/">theory</a>.<br /><a href="#fnref9" class="footnoteback">↩</a></p></li>
<li id="fn10"><p>It is also, after a change of variables, the evolution equation for the price in the BlackScholes model.<br /><a href="#fnref10" class="footnoteback">↩</a></p></li>
<li id="fn11"><p>Interestingly, given their use in unsupervised dimensionreduction approaches, it’s not clear how much of an advantage Laplacianpenalized approaches provide for prediction when the predictors are hypothesized to lie on (or near) an <i>unknown</i> manifold, even in the case where auxiliary data can be used to estimate the manifold structure. <a href="http://repository.cmu.edu/compsci/1030/">Lafferty and Wasserman</a> find that standard kernel regression with a properly chosen bandwidth already achieves optimal lowdimensional rates if predictors lie on a manifold, and <a href="http://arxiv.org/abs/1305.0617">similar results exist</a> for Gaussian Process regression. On the other hand, <a href="http://jmlr.org/papers/v14/niyogi13a.html">recent</a> <a href="http://arxiv.org/abs/1204.1685">work</a> has found more refined conditions under which knowledge of the manifold may help prediction.<a href="#fnref11" class="footnoteback">↩</a></p></li>
</ol>
</div>