David Childers on David Childers
/
Recent content in David Childers on David Childers
Hugo  gohugo.io
enus
© 2018
Sun, 15 Oct 2017 00:00:00 0400

Local Causal Discovery for Estimating Causal Effects
/publication/localcausaldiscovery/
Fri, 17 Feb 2023 00:00:00 0500
/publication/localcausaldiscovery/

Differentiable State Space Models and Hamiltonian Monte Carlo Estimation
/publication/differentiablestatespace/
Thu, 06 Oct 2022 00:00:00 0400
/publication/differentiablestatespace/

Automated Solution of Heterogeneous Agent Models
/publication/automatedsolution/
Wed, 13 Jul 2022 00:00:00 0400
/publication/automatedsolution/

SED 2022  Notes on contemporary macro
/post/sed2022/
Sat, 02 Jul 2022 00:00:00 +0000
/post/sed2022/
<script src="/rmarkdownlibs/headerattrs/headerattrs.js"></script>
<p>I just got back from an excellent <a href="https://www.economicdynamics.org/sedam_2022/">meeting of the Society for Economic Dynamics</a>, a top conference for work in dynamic economics, principally but not exclusively in macroeconomics. As one of the first inperson conferences I’ve been to since 2020 (last year they were hybrid and I presented from home), it was a chance to catch up not just with colleagues and friends but also with the state of modern academic macro, after some time focusing more on other things. While the conference is fresh in my mind, I thought I’d jot down a few bigger picture notes to start thinking about where the field is and where it might be headed.</p>
<p>Of course, the conference has so many parallel sessions that I’m sure no two people had the same experience, aside from the plenary talks, and my particular focus, mostly on computational and econometric methods, is a specialized niche within the whole, but given that it’s particularly valuable for methodologists to have a sense of what applied problems people currently work on and how they’re going about it, I did try to explore a little more. That said, even within what I saw these are just first impressions and themes.</p>
<p><strong>HANK models have matured</strong></p>
<p>Research literatures in macroeconomics seem to have a life cycle that goes in stages.</p>
<p>First, some creative thinkers come up with a concept and implement an early version showing it can be done. For heterogeneous agent New Keynesian (HANK) literature, around the early to mid 2010s, the idea was to merge our <a href="https://donskerclass.github.io/post/anempiricalheterogeneousagentsmodelsreadinglist/">benchmark incomplete markets models</a> of inequality and individual spending and saving behavior with our <a href="https://press.princeton.edu/books/hardcover/9780691164786/monetarypolicyinflationandthebusinesscycle">benchmark New Keynesian models</a> of monetary policy, inflation, and business cycles to start to answer questions about how they interact.</p>
<p>Second, the research community enters a stage of twiddling with all the knobs, investigating all the features of a type of model and understanding which features are important for which outcomes and why, and how they interact. Some of the choices in the initial model may have been just placeholder first guesses, that after a period of trial and error over different specifications will be swapped out for something more robust or tractable until the literature settles down on a small set of benchmarks. HANK had been actively in this stage in the late 2010s, with many competing variations working out questions about specification in terms of fiscal transfers, portfolio choices, preferences and so on, along with the role of methods (discrete vs continuous time, MIT shocks vs KrusellSmith vs perturbation, sequence versus state space, etc). There’s still some of this settling of basic questions going on, but there seemed to be more of this in the previous 2 or 3 SED meetings.</p>
<p>Third, after enough knob twiddling, people understand the framework well enough to put the model to work, as a tool for basic measurement and for policy analysis. HANK now seems to be entering in this mature stage, what Kuhnians would call “normal science”, with lots of applications to understanding the effects of particular policy proposals or shocks, measuring and quantifying different sources of inequality, and as a baseline for incorporating new proposed deviations or frictions. I went to a lot of talks where the format was “here’s a policy issue or fact to explain. We motivate with some simple empirics and maybe a toy model with just the one force, then embed it in a quantitative HANK model to measure that it explains x percent of this trend, or implies that this policy is x percent more/less effective” and fewer that were trying to resolve basic issues of the form “what happens if we take a HANK model and swap out sticky prices vs wages, or real vs nominal debt, etc”.</p>
<p>I suppose beyond those stages, other literatures in macroeconomics that have tread the path before (representative agent DSGE?) maintain a long lifetime of continued routine use, often fading a bit from the academic spotlight but continuing to be useful to policymakers, both for daytoday measurement and as a mature and reliable way to get a first pass at the pressing issues of the day. Beyond that, either continued probing of points of empirical dissatisfaction or merging with ideas from some previously disjoint strand lead to new strands of research ideas. For HANK, I suspect much of the original interest was from the desire to reconcile the profession’s incompatible benchmark models of individual behavior and of business cycle aggregates, with the most notable empirical problem being dissatisfaction with the MPC implications of the typical representative agent <a href="http://noahpinionblog.blogspot.com/2014/01/theequationatcoreofmodernmacro.html">Euler equation</a>. I think there are many years left of both basic exploration and normal science left to do with HANK, though pattern matching suggests that something will eventually come along to form the next generation. It seems too early to specify what that might be. Despite widespread disease with other aspects of the New Keynesian paradigm for monetary economics and many proposals for modifications or replacements, it has proved remarkably persistent by serving as a baseline framework for encompassing divergent views on mechanisms and policies. While disagreements remain, the days when “freshwater” and “saltwater” macroeconomists expressed disagreement mostly through <a href="https://en.wikiquote.org/wiki/Robert_Lucas_Jr.#Quotes:~:text=The%20main%20development,submitted%20any%20more.">laughter</a> have largely been replaced by conversations about parameter values in shared model classes between researchers who cannot be clustered nearly so easily into ideological camps.</p>
<p><strong>Plenaries</strong></p>
<p>The plenary talks were a good chance to get bigger picture overviews of different subfields, at a range of stages of maturity.</p>
<ul>
<li><p>Giuseppe Moscarini gave a talk on his work over the past decade in the area of crosssectional wage dynamics, which has developed since the seminal work of <a href="https://www.jstor.org/stable/2527292">Burdett and Mortensen</a> into a mature area providing a foundation for studies of employerfirm matched data, wage inequality, monopsony, career progression, and so on. He started with an overview of his work with PostelVinay on the role of jobtojob switching in wage growth. Then, continuing the theme of the NewKeynesianization of everything, while the more tractable DiamondMortensenPissarides model of <em>aggregate</em> labor market flows had been merged into monetary models, he presented new work merging the disaggregated “jobladder” style models into a NK framework suggesting that <a href="https://campuspress.yale.edu/moscarini/data/">aggregate jobtojob recruiting</a>, as opposed to just unemployment, is an important and cyclically distinct determinant of aggregate wage inflation.</p></li>
<li><p>Esteban RossiHansberg presented work that strikes me as very much in the new paradigm stage, on integrating regional heterogeneity into integrated assessment climate models. The question is compelling: while the carbon cycle is global, impacts and adaptation efforts are highly diverse across places, and figuring out how locations which may face very different flooding, extreme weather, temperature changes, and so on will adapt economically is important for measuring global costs and coordinating mitigation efforts. With recent progress in <a href="https://rossihansberg.economics.uchicago.edu/QSE.pdf">quantitative spatial economics</a> and computational methods applicable to high resolution heterogeneity, models can now incorporate detailed spatial economic data along with high resolution climate data and simulations, and RossiHansberg and collaborators have provided some noteworthy examples. But as he emphasized in the talk, there’s still a lot to learn about how the basic economic mechanisms work, given their current difficulty, and I suspect there is a lot of “knobtwiddling” work to do just to figure out what are the important aspects to put into such a model and how to specify and solve them, before these reach the normal science stage where we can just focus on arguing over a few crucial parameters like climate macroeconomists working with aggregate models have been up to for years now. This talk inspired me, though I currently don’t do any work in climate, to attend some of the climate sessions later on in the conference, where young researchers are working hard to figure it out.</p></li>
<li><p>On the last day, IMF chief economist Gita Gopinath gave a talk on how open economy macroeconomic research informs the current work of the Fund and its policy framework. The talk was surprisingly academic in style, with a discussion of models and empirics in a way that you don’t usually get out of publicfacing speeches by policymakers, but directed at informing working economists about the role the work plays in the policy process. This involves aggregating a long history of work on available policy tools to synthesize policy recommendations, not any single model or study but a systematic review of many, with some modeling work done mainly to quantify and reconcile models of competing effects each described individually. The resulting framework reflects a very gradual evolution of the Fund’s views, from its 1990s MundellFleming inspired recommendations for exchange rate flexibility as a stabilizing buffer, to incorporating decades of work since the Asian Financial Crisis of 199798 on models of borrowing constraints, sudden stops, and financial frictions suggesting that in some contingent cases capital controls may be a desirable measure. This stance broadly was already conventional wisdom by the time I graduated with an International Economics degree in 2009, but putting it in an official IMF policy document collecting a large number of careful studies of pros and cons represents a long process. As a bonus, she also gave a brief overview of one of her preIMF era research contributions, on dominant currency pricing in open economy New Keynesian models. This is a clear example of valuable knobtwiddling research, showing that the symmetric pricing assumption used in early models largely out of convenience was not only implausible but also consequential, with likely implications for global trade volumes during the current Fed tightening cycle.</p></li>
</ul>
<p><strong>Miscellaneous thoughts</strong></p>
<ul>
<li><p>On the econometrics side, after watching a presentation by Ashesh Rambachan on IRF interpretation (<a href="https://asheshrambachan.github.io/assets/files/arns_commontimeseries_causal.pdf">paper</a>, <a href="https://donskerclass.github.io/CausalEconometrics/TimeSeries.html">my notes</a>), I saw the implications all over other talks. Micro people have been reckoning with the need to precisely define the counterfactual path of a shock in dynamic models, as measured IRFs can be a mixture of other things. Some talks with IRFs gave this serious thought, formally or not, others not so much; gratefully, audiences seemed willing to provide helpful feedback in those cases.</p></li>
<li><p><a href="https://scholar.harvard.edu/straub/publications/usingsequencespacesolveandestimateheterogeneousagentmodels">Sequence space methods</a> for heterogeneous agents models are seeing really fast adoption, from a fairly technical 2021 Econometrica paper to a relatively common approach. In addition to speed, I think this in part reflects interpretability, since it lets economists derive equilibrium conditions which can be informative even before fully solving numerically.</p></li>
<li><p>Wisconsin cheese curds are better than I expected.</p></li>
</ul>

Papers I Liked 2021
/post/papersiliked2021/
Fri, 31 Dec 2021 00:00:00 +0000
/post/papersiliked2021/
<script src="/rmarkdownlibs/headerattrs/headerattrs.js"></script>
<p>A list of 10 papers I read and liked in 2021. As in previous years, this is by date read rather than released or published, and selection is in no particular order. Overall, my list reflects my interests this year, prompted by research and teaching, in online learning, microfounded macro, and causal inference, and, to the extent possible, intersections of these areas. As usual, I’m likely to have missed a lot of great work even in areas on which I focus, so absence likely indicates that I didn’t see it, or it’s on my ever expanding to read list, so ping me with your recommendations!</p>
<ul>
<li>Block, Dagan, and Rakhlin <a href="http://arxiv.org/abs/2102.01729">Majorizing Measures, Sequential Complexities, and Online Learning</a>
<ul>
<li>Sequential versions of metric entropy type conditions which are the bread and butter of iid data analysis extended to the setting of online estimation.</li>
<li>See also: This builds on Rakhlin, Sridharan, and Tewari (2015)’s <a href="https://link.springer.com/content/pdf/10.1007/s0044001305455.pdf">essential earlier work</a> on uniform martingale laws of large numbers by sequential versions of Rademacher complexity. More generally, there were many advances in inference and estimation for onlinecollected data this year: see the papers at the NeuRIPS <a href="https://sites.google.com/view/causalsequentialdecisions/home">Causal Inference Challenges in Sequential Decision Making</a> workshop for a few.</li>
</ul></li>
<li>Klus, Schuster, and Maundet <a href="https://arxiv.org/abs/1712.01572">Eigendecompositions of Transfer Operators in Reproducing Kernel Hilbert Spaces</a>
<ul>
<li>A Koopman operator <span class="math inline">\(\mathcal{K}[f_t](x)\)</span> mapping <span class="math inline">\(f\to E[f(X_{t+\tau})X_t=x]\)</span>, is a way of summarizing a possibly nonlinear and high dimensional dynamical system using a linear operator, which allows summarization and computation using linear algebra tools. Since this is effectively an evaluation operator, it pairs nicely with kernel mean embeddings and RKHS theory, which gives precisely the properties needed to make these objects well behaved, allowing analysis in arbitrarily high dimension at no extra cost.</li>
<li>See also: Budišić, Mohr, and Mezić (2009) <a href="https://doi.org/10.1063/1.4772195">Applied Koopmanism</a> for an intro to Koopman analysis of dynamical systems.</li>
</ul></li>
<li>AntolínDíaz, Drechsel, and Petrella <a href="http://econweb.umd.edu/~drechsel/papers/advances.pdf">Advances in Nowcasting Economic Activity: Secular Trends, Large Shocks and New Data</a>
<ul>
<li>Classic linear time series models used in forecasting, causal, and structural macroeconomics have taken a beating in the past two years with the huge fluctuations due to the pandemic. But a dirty secret known to forecasters is that black box ML models designed to be much more flexible have, if anything, an even more dismal track record. This work adding carefully specified and empirically validated mechanisms for shifts, outliers, mean and volatility changes and so on to the kind of dynamic factor models that have substantially outperformed offers a chance to substantially improve fit and handling of big shifts while retaining performance. This attention to distributional properties of macro data is surprisingly rare, and should encourage more work on understanding the sources of these features.</li>
</ul></li>
<li>Karadi, Schoenle, and Wursten <a href="https://sites.google.com/site/pkaradi696/KaradiSchoenleWursten.pdf">Measuring Price Selection in Microdata: It’s Not There</a>
<ul>
<li>A venerable result in sticky price models, going back to Golosov and Lucas, is that “menu costs” of price changes ought to result in very limited actual real response of output to monetary impulses because even though costs keep prices fixed most of the time, any product that is seriously mispriced will be selected to have its price changed, so real effects should be minimal. This paper tests that theory directly using price microdata and shows that in response to identified monetary shocks, the prices that change do not appear to be those which are out of line, suggesting a much smaller selection effect than in baseline menu cost models. I liked this paper, beyond the importance of its empirical results, as a model for combining micro and macro data: to claim a microeconomic mechanism responds to an aggregate shock, your results are much more credible actually measuring variation in that shock and the micro response to it, rather than only using macro or only using micro variation.</li>
<li>See also: Wolf <a href="http://economics.mit.edu/files/22576">The Missing Intercept: A Demand Equivalence Approach</a> describing how <em>both</em> causal variation at the micro (crosssectional) level and at the macro (time series) level are necessary to identify aggregate responses. This kind of hybrid approach is a welcome change which takes into account both the value of “<a href="https://doi.org/10.1257/jep.32.3.59">identified moments</a>” using microeconomic causal inference tools in macro with the reality that if you want to credibly measure aggregate causal effects, you need random variation at the aggregate level also.</li>
</ul></li>
<li>Hall, Payne, Sargent, and Szöke <a href="https://people.brandeis.edu/~ghall/papers/Yield_Curve_May_10_2021.pdf">HicksArrow Prices for US Federal Debt 17911930</a>
<ul>
<li>A time series of risk and term structure adjusted U.S. interest rates going way back, estimated using a Bayesian hierarchical term structure model, which allows handling the variety of bond issuance terms and missingness that make comparing over time using models for modern yield curves quite difficult.</li>
<li>See also: <a href="https://turing.ml/">Turing</a>, the probabilistic programming language used for these results, which combines modern MCMC sampling algorithms with the full power of Julia’s Automatic Differentiation stack to allow fitting even complicated structural models with elements not standard in more statisticsspecialized programming languages and benefitting from the ability of Bayes to handle inference with complicated missingness and dependence structures that become extremely challenging without it, even for simulationbased estimators.</li>
</ul></li>
<li>Callaway and Sant’Anna <a href="https://doi.org/10.1016/j.jeconom.2020.12.001">DifferenceinDifferences with multiple time periods</a>
<ul>
<li>The diffindifferdämmerung struck hard this year, with methods for handling DiD (particularly but not only with variation in treatment timing) up in the air and new papers coming out at an increasing pace. In trying to summarize at least <a href="https://donskerclass.github.io/CausalEconometrics/DifferenceinDifferences.html">a bit of this literature</a> for <a href="https://donskerclass.github.io/CausalEconometrics.html">a new class</a>, I found this paper, and others by Pedro Sant’Anna and collaborators, crystal clear about the sources of the issues and how to resolve them, with the bonus of welldocumented <a href="https://bcallaway11.github.io/did/">software</a> and extensive examples.</li>
<li>See also: too many papers on DiD to list.</li>
</ul></li>
<li>Farrell, Liang, and Misra <a href="https://arxiv.org/abs/2010.14694">Deep Learning for Individual Heterogeneity: An Automatic Inference Framework</a>
<ul>
<li>Derives influence functions and doubly robust estimators for conditional lossbased estimation allowing, e.g., nonparametric dependence of coefficients on highdimensional inputs in Generalized Linear Models. Results are flexible enough to be widely applicable, and simple enough to be easy to implement and interpret.</li>
<li>See also: Hines, Dukes, DiazOrdaz, Vansteelandt <a href="https://arxiv.org/abs/2107.00681">Demystifying statistical learning based on efficient influence functions</a> for an overview of this increasingly essential but alwaysconfusing topic</li>
</ul></li>
<li>Foster and Syrgkanis <a href="https://arxiv.org/abs/1901.09036">Orthogonal Statistical Learning</a>
<ul>
<li>A very general theory extending “Double Machine Learning” approach to loss function based estimation where instead of a root n estimable regular parameter, you may have a more complex object like a function (e.g., a conditional treatment effect, a policy, etc) which you want to make robust to high dimensional nuisance parameters.</li>
<li>See also: I went back and reread the published version of the <a href="https://doi.org/10.1111/ectj.12097">original “Double ML” paper</a> to write up teaching notes, which was helpful for really thinking through the results.</li>
</ul></li>
<li>Rambachan and Shephard <a href="https://asheshrambachan.github.io/assets/files/arns_commontimeseries_causal.pdf">When do common time series estimands have nonparametric causal meaning?</a>
<ul>
<li>Potential outcomes for time series are a lot harder than you would think at first, because repeated intervention necessarily vastly expands the space of possible relationships between treatments, and between treatments and outcomes. This paper lays out the issues and proposes some solutions.</li>
<li>See also: I based my <a href="https://donskerclass.github.io/CausalEconometrics/TimeSeries.html">time series causal inference teaching notes</a> mostly on this paper.</li>
</ul></li>
<li>Breza, Kaur, and Shamdasani <a href="https://drive.google.com/file/d/1RiMgkKu7DJqfnqU3vIqQszD1TEREY9jf/view?usp=sharing">Labor Rationing</a>
<ul>
<li>How do economies respond to labor supply shocks? Breza and et al just go out there and run the experiment, setting up a bunch of factories and hiring away a quarter of eligible workers in half of 60 villages in Odisha. In peak season wages rise, like textbook theory, but in lean season, wages do nothing as it appears most workers are effectively unemployed.</li>
<li>See also: The authors’ other work in the same setting testing theories of wage rigidity. For example, they find strong <a href="https://drive.google.com/file/d/1Z2ZsrFZ71Upq7dvN7MXy225HT0HgkqY/view?usp=sharing">experimental support</a> for <a href="https://delong.typepad.com/files/bewleywages.pdf">Bewley’s</a> morale theory for why employers don’t just cut wages. By running experiments at the market level, they have been able to provide a lot of compelling evidence on issues that have previously been relegated to much more theoretical debate.</li>
</ul></li>
</ul>

Efficient Online Estimation of Causal Effects by Deciding What to Observe
/publication/onlinemomentselection/
Mon, 23 Aug 2021 00:00:00 0400
/publication/onlinemomentselection/

Estimating Treatment Effects with Observed Confounders and Mediators
/publication/confoundersmediators/
Mon, 14 Jun 2021 00:00:00 0400
/publication/confoundersmediators/

Top Papers 2020
/post/toppapers2020/
Wed, 30 Dec 2020 00:00:00 +0000
/post/toppapers2020/
<p>The following is a look back at my reading for 2020, identifying a totally subjective set of the top 10 papers I read this year. My reading patterns, as usual, have not been so systematic, so if your brilliant work is missing it either slipped past my attention or is living in an everexpanding set of folders and browser tabs on my toread list. I’ll exclude papers I refereed, for privacy purposes (a fair amount if you include conferences and also cutting out a lot of the macroeconomics from my list). Themes I focused on were Bayesian computation, the optimal policy estimation/dynamic treatment regime/offline reinforcement learning space, and survival/point process models, all moreorless projectrelated and in all of which I’m sure I’m missing some foundational understanding. I spent a brief time in March mostly reading about basic epidemiology, which I am led to believe many others did as well, but didn’t take it anywhere.</p>
<p>Papers, in alphabetical order</p>
<ul>
<li>Adusumilli, Geiecke, Schilter. <a href="https://arxiv.org/abs/1904.01047">Dynamically optimal treatment allocation using reinforcement learning</a>
<ul>
<li>Approximation methods for estimating viscosity solutions of HJB equations and their resulting optimal policies policies from data. These methods will form a key step in taking continuous time dynamic macro models (see <a href="https://benjaminmoll.com/lectures/">Moll lecture notes</a>) to data.</li>
</ul></li>
<li>Andrews & Mikusheva <a href="https://scholar.harvard.edu/iandrews/publications/optimaldecisionrulesweakgmm">Optimal Decision Rules for Weak GMM</a>
<ul>
<li>The Generalized Method of Moments defines a semiparametric estimator implicitly, making it often unclear what the form of the nuisance parameter being ignored actually is, especially in cases of irregular identification. This paper takes a middle ground between the fully Bayesian semiparametric approach which puts a (usually Dirichlet Process) prior over the infinite dimensional nuisance space and the regular frequentist approach which ignores it entirely, showing weak convergence to a Gaussian Process, which is tractable enough to characterize and apply to obtain approximate Bayesian tests and decision rules without strong identification.</li>
</ul></li>
<li>Cevid, Michel, Bühlmann, & Meinshausen <a href="https://arxiv.org/abs/2005.14458">Distributional Random Forests : Heterogeneity Adjustment and Multivariate Distributional Regression</a>
<ul>
<li>Conditional density estimation by random forests with splits by (approximate) kernel MMD distribution tests. Produces a set of conditional weights that can be used to represent and visualize possibly multivariate conditional distributions. An <a href="https://github.com/lorismichel/drf">R package</a> is available and this really quickly became one of my goto data exploration tools.</li>
<li>See also: Lee and Pospisil have a <a href="https://github.com/tpospisi/rfcde">related method</a> splitting by sieve <span class="math inline">\(L^2\)</span> distance tests which is more or less similar, though more tailored to low dimensional outputs.</li>
</ul></li>
<li>Gelman, Vehtari, Simpson, Margossian, Carpenter, Yao, Kennedy, Gabry, Bürkner, Modrák. <a href="http://arxiv.org/abs/2011.01808">Bayesian Workflow</a>
<ul>
<li>A comprehensive overview of what Bayesian statisticians actually do when analyzing data, as opposed to the mythology in our intro textbooks (roughly, the likelihood is given to you by God, you think real hard and come up with a prior, then you apply Bayes rule and are done). It includes all the bits of sequential model expansion and checking and computational diagnostics and compromise between simplicity, convention, and domain expertise you actually go through to build a Bayesian model from scratch. The contrarian in me would love to see more frequentist analysis of this paradigm. A lot of the checks are there to make sure you’re not fooling yourself; how well do they work in practice?</li>
<li>See also Michael Betancourt’s <a href="https://betanalpha.github.io/writing/">notebooks</a> for worked examples of this process.</li>
</ul></li>
<li>Giannone, Lenza, Primiceri <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2018.1483826">Priors for the Long Run</a>
<ul>
<li>Exact rank constraints for cointegration are often uncertain, making pure VECM modeling a bit fraught, but standard priors on the VAR form are not strongly constraining of long run relationships, and improper treatment of initial conditions can lead to spurious inference on trends. This proposes a simple class of priors which allow “soft” constraints.</li>
</ul></li>
<li>Kallus and Uehara <a href="http://arxiv.org/abs/1908.08526">Double Reinforcement Learning for Efficient OffPolicy Evaluation in Markov Decision Processes</a>
<ul>
<li>Characterizes the semiparametric efficiency bound for the value of a dynamic policy and provides a doubly robust estimator combining the appropriate variants of a regression statistic and a (sequential) probability weighting statistic, allowing use of nonparametric and (with sample splitting) machine learning estimates in reinforcement learning while retaining parametric convergence rates.</li>
<li>See also companion papers on estimating the <a href="https://arxiv.org/abs/2002.04014">policy and policy gradient</a> and extending to the case of <a href="http://arxiv.org/abs/2006.03900">deterministic policies</a> (which require smoothing) among others, or watch <a href="https://www.youtube.com/watch?v=n5ZoxT_WmHo">the talk</a> for an overview.</li>
</ul></li>
<li>Sawhney & Crane <a href="https://dl.acm.org/doi/abs/10.1145/3386569.3392374">Monte Carlo Geometry Processing: A GridFree Approach to PDEBased Methods on Volumetric Domains</a>
<ul>
<li>I don’t usually read papers in computer graphics, but I do care a lot about computing <a href="https://donskerclass.github.io/post/whylaplacians/">Laplacians</a> and this paper offers a clever new Monte Carlo based method that allows computation on much more complicated domains. It’s not yet obvious to me whether the method generalizes to the PDE classes I and other macroeconomists <a href="https://benjaminmoll.com/wpcontent/uploads/2019/07/PDE_macro_translated.pdf">tend to work with</a>, but even if not it should still be handy for many applications.</li>
</ul></li>
<li>Schmelzing <a href="https://www.bankofengland.co.uk/workingpaper/2020/eightcenturiesofglobalrealinterestratesrgandthesupraseculardecline13112018">Eight Centuries of Global Real Interest Rates, RG, and the ‘Suprasecular’ Decline, 1311–2018</a>
<ul>
<li>An enormous data collection process and public good which will be informing research on interest rates for years to come. As with any such effort at turning messy historical data into aggregate series, many contestable choices go into data selection, standardization, and normalization, and I don’t think the author’s simple trend estimates of a several hundred year decline will be the last word on the statistical properties or future implications of this data, but now that it’s out there we have a basis for discussion and testing.</li>
<li>See also: lots of useful historical macro data collection (going not quite so far back) by the folks at the Bonn <a href="http://www.macrohistory.net/">Macrohistory Lab</a>.</li>
</ul></li>
<li>Wolf <a href="https://www.aeaweb.org/articles?id=10.1257/mac.20180328">SVAR ( Mis ) Identification and the Real Effects of Monetary Policy</a>
<ul>
<li>A nice practical application of Bayesian model checking, applying SVAR methods to simulated macro data when the (usually a bit suspect) identifying restrictions need not hold exactly. It finds that early signrestricted BVARs with uniform (Haar) priors tend to be biased toward finding monetary neutrality, and do not in fact provide noteworthy evidence contradicting the implied shock responses of typical central bank monetary DSGEs. Of course, such models have many other problems and not being contradicted by one test is not dispositive, but macro debates would be elevated if people would check to make sure that their contradictory evidence is in fact contradictory (respectively, supportive).</li>
</ul></li>
<li>Wang and Blei <a href="http://arxiv.org/abs/1905.10859">Variational Bayes under Model Misspecification</a>
<ul>
<li>Describes what (mean field) variational Bayes ends up targeting, at least in cases where a Bernstein von Mises approximation works well enough. Also covers the much more nontrivial case with latent variables.</li>
<li>I will judiciously refrain from comment on other recent works by this pair (discussion <a href="https://casualinfer.libsyn.com/fairnessinmachinelearningwithsherriroseepisode03">1</a>, <a href="https://casualinfer.libsyn.com/episode15drbetsyogburn">2</a>) except to say that dimensionality reduction in causal inference deserves more study and this <a href="https://drive.google.com/file/d/1aN1cK_UEffkBT34a2aNtrZKIJFw_xibX/view?usp=sharing">manifold learning approach</a> to create a nonparametric version of interactive fixed effects estimation looks like a useful supplement the standard panel data toolbox.</li>
</ul></li>
</ul>

Some issues with Bayesian epistemology
/post/someissueswithbayesianepistemology/
Sat, 05 Sep 2020 00:00:00 +0000
/post/someissueswithbayesianepistemology/
<p>In this post, I’d like to lay out a few questions and concerns I have about Bayesianism and Bayesian decision theory as a <em>normative</em> theory of inductive inference. As a positive theory, of what people do, psychology is full of demonstrations of cases where people do not use Bayesian reasoning (the entire “heuristics and biases” area), which is interesting but not my target. There are no new ideas here, just a summary of some old concerns which merit more consideration, and not even necessarily the most important ones, which are better covered elsewhere.</p>
<p>My main concerns are, effectively, computational. As I understand computer science, the processing of information <em>requires real resources</em>, (mostly time, but also energy, space, etc) and so any theory of reasoning which <em>mandates</em> isomorphism between statements for which computation is required to demonstrate equivalence is effectively ignoring real costs that are unavoidable and so must have some impact on decisions. Further, as I understand it, there is no way to get around this by simply adding this cost as a component of the decision problem.<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a> The problem here is that determination of this cost and reasoning over it is also computationally nontrivial, and so the costs of this determination must be taken into account. But determining these is also costly, ad infinitum. It may be the case that there is some way around this infinite regress problem via means of some kind of fixed point argument, though it is not clear that the limit of such an argument would retain the properties of Bayesian reasoning.</p>
<p>The question of these processing costs becomes more interesting to the extent that they are quantitatively nontrivial. As somebody who spends hours running and debugging MCMC samplers and does a lot of reading about Bayesian computation, my takeaway from this literature is that the limits are fundamental. In particular, there are classes of distributions such that the Bayesian update step is hard, for a variety of hardness classes. This includes many distributions where the update step is NP complete, so that our best understanding of P vs NP suggests that the time to perform the update can be exponential in the size of the problem (sampling from spin glass models is an archetypal example, though really any unrestricted distribution over long strings of discrete bits will do). I suppose a kind of trivial example of this is the case with prior mass 1, in which case the hardness reduces to the hardness of the deterministic computation problem, and so encompasses every standard problem in computer science. More than just exponential time (which can mean use of time longer than the length of the known history of the universe for problems of sizes faced practically by human beings every day, like drawing inferences from the state of a high resolution image), some integration problems may even be uncomputable in the Turing sense, and so not just wildly impractical but impossible to implement on any physical substrate (at least if the ChurchTuring hypothesis is correct). Amusingly, this extends to the problem above of determining the costs of practical transformations, as determining whether a problem is computable in finite time is itself the classic example of a problem which is not computable.</p>
<p>So, exact Bayesianism for all conceivable problems is physically impossible, which makes it slightly less compelling as a normative goal. What about approximations? This will obviously depend on what counts as a reasonable approximation; if one accepts the topology in which all decisions are equivalent, then sure, “approximate” Bayesianism is possible. If one gets to stronger senses of approximation, such as requiring computation to within a constant factor, for cases where this makes sense, there are inapproximability results suggesting that for many problems, one still has exponential costs. Alternately, one could think about approximation in the limit of infinite time or information; this then gets into the literature on Bayesian asymptotics, though I guess with the goal exactly reversed. Rather than attempt to show Bayes converges to a fixed truth in the limit, one would try to show that a feasible decision procedure converges to Bayes in the limit. For the former goal, impossibility results are available in the general case, with positive results, like Schwartz’s theorem and its quantitative extensions ( <a href="http://www.math.leidenuniv.nl/~avdvaart/BNP/">notes</a> and <a href="https://www.cambridge.org/core/books/fundamentalsofnonparametricbayesianinference/C96325101025D308C9F31F4470DEA2E8">monograph</a>) relying on compactness conditions that are more or less unsurprising given what is known on minimax lower bounds from information theory on what cannot be learned in a frequentist sense. For the other direction (whatever that might mean), I don’t know what results have been shown, though I expect, given the computational limitations in worst case priorlikelihood settings, that no universally applicable procedure is available.</p>
<p>How about if we restrict our demands from Bayesianism, for any prior and likelihood to Bayesian methods for some reasonable prior or class of priors? In small enough restricted cases, this seems obviously feasible: we can all do BetaBernoulli updating at minimal cost, which is great if the only information we ever receive is a single yes no bit. If we want Bayesianism to be a general theory of logical decision making, it probably has to go beyond that. Some people like the idea of <a href="https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_inductive_inference">Solomonoff induction</a>, which proposes Bayesian inference with a “universal prior” over all <em>computable</em> distributions, avoiding the noncomputability problem in some sense. This proposes a prior mass on all programs exponentially decreasing in their length expressed in bits in the Kolmogorov complexity sense. Aside from the problem that it runs into computational hardness results for determining Kolmogorov complexity and so is not itself computable, running into the above issues again, there are some additional questions.</p>
<p>This exponentially decreasing tail condition seems to embed the space of all programs into a hyperrectangle obeying summability conditions sufficient to satisfy Schwartz’s theorem for frequentist consistency of Bayesian inference. Hyperrectangle priors are fairly well studied in Bayesian nonparametrics: lower bounds are provided by the <a href="https://www.stat.berkeley.edu/~binyu/summer08/yu.assoua.pdf">Assouad’s lemma</a> construction and upper bounds are known and in fact reasonably small: by Brown and Low’s <a href="https://projecteuclid.org/euclid.aos/1032181159">equivalence results</a>, they are equivalent to estimation of a Höldersmooth function, for which an appropriately integrated Brownian motion prior provides nearminimax rates. This seems to be saying that universal induction as a frequentist problem is slightly easier than one of the easier single problems in Bayesian nonparametrics. This seems… a little strange, maybe. One way to look at this is to accept, and say that the infinities contemplated by nonparametric inference are the absurd thing, or to marvel that a simple Gaussian Process regression is at least as hard as understanding all laws deriving the behavior of all activity in the universe and be grateful that it only takes cubic time. The other alternative is to suggest that this implies that the prior, while covering the entire space in principle, is satisfying a tightness condition that is so restrictive that it effectively restricts you to a topologically meager set of programs, ruling out in some sense the vast majority of the entire space (this sense is that of <a href="https://en.wikipedia.org/wiki/Baire_category_theorem">Baire category</a>) in the same way that any two Gaussian process priors with distinct covariance functions are mutually singular. In this sense, it is an incredibly strong restriction and hard to justify ex ante, certainly at least as contestable as justifying an exante fixed smoothing parameter for your GP prior. (If you’ve ever run one of these, you know that’s a dig: people make so many ugly blurry maps.)</p>
<p>Alternatives might arise in fixed tractable inference procedures, or the combination of tractable procedures and models, though all of these have quite a few limitations. MCMC has the same hardness problems as above if you ask for “eventual” convergence, and fairly odd properties if run up to a fixed duration (including nondeterministic outcomes and a high probability that those outcomes exhibit various results often called logical fallacies or biases, which I suppose is not surprising since common definitions of biases or fallacies appear to essentially require Bayesian reasoning to begin with.) Variational inference likewise has these issues: even with the variational approximation, note that optimization to reach the modified objective may still be costly or even arbitrarily hard in some cases. Various neuroscientists seem to have taken up <a href="https://en.wikipedia.org/wiki/Free_energy_principle">some forms of variational inference</a> as a descriptive model of brain activity. Without expertise in neuroscience, I will leave well enough alone and say it seems like something that merits further empirical exploration. But as somebody who runs variational inference in practice, with mixed and sometimes surprising results, and computational improvements that don’t always suggest that the issue of cost is resolved, it also doesn’t seem like a full solution. I’m happy my model takes an hour rather than two days to run; I’m not sure this makes the method a compelling basis for decisionmaking.</p>
<p>I was going to extend this to say something about Bayes Nash equilibrium, but my problems with that concept are largely distinct, coming from the “equilibrium” rather than the “Bayes”<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a> but I think I’ve conveyed my basic concerns. I don’t know that I have a compelling alternative, except that it may be the case that while an acceptable and actually feasible theory of decision making may have internal states of some kind, I see no reason that one has to have “beliefs” of any kind, at least as objects which reduce in some way to classical truth values over statements. One can simply have actions, which may on occasion correspond to binary decisions over sets that could in principle be assigned a truth value, though usually they won’t. This seems to have connections to the idea of lazy evaluation in nonparametric Bayes, which permits computations consistent with Bayes rule over a highdimensional space to be retrieved via query without maintaining the full set of possible responses to such queries in memory. But this is only possible in a tractable way while still permitting the results to follow Bayesian inference for fairly limited classes of problems involving conjugacy. More generally, a theory which fully incorporates computational costs will likely have to await further developments in characterizing these costs, a still not fully solved problem in computer science.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>This is something like what theories of “rational inattention” do. However, information processing costs in these theories are taken over a channel between information which still has the representation as a random variable on both sides. The agent is assumed to be on one side of this channel and so is effectively still dealing with information in a fully probabilistic form over which the optimization criterion still requires reasoning to be Bayesian. That is to say, rational inattention is a theory of the information available to an agent, not a theory of the reasoning and decisionmaking process given available information.<a href="#fnref1" class="footnoteback">↩</a></p></li>
<li id="fn2"><p>Roughly, even a computationally unlimited Bayesian agent could not reason itself to Bayes Nash equilibrium, unless it had the “right” priors. Where these priors come from is left unspoken (except that in the model they are “true”), which is a practical problem that drives a lot of differences between applied computational mechanism design, which is forced to answer this question, and the theory we teach our grad students.<a href="#fnref2" class="footnoteback">↩</a></p></li>
</ol>
</div>

Posterior Samplers for Turing.jl
/post/posteriorsamplersforturingjl/
Sun, 28 Jun 2020 00:00:00 +0000
/post/posteriorsamplersforturingjl/
<p>Prompted by a question on the slack for <a href="https://turing.ml/" target="_blank">Turing.jl</a> about when to use which Bayesian sampling algorithms for which kinds of problems, I compiled a quick offthecuff summary of my opinions on specific samplers and how and when to use them. Take these with a grain of salt, as I have more experience with some than with others, and in any case the nice thing about a framework like Turing is that you can switch out samplers easily and test for yourself which is best for your application.</p>
<p>To get a good visual sense of how different samplers explore a parameter space, the animations <a href="https://chifeng.github.io/mcmcdemo/" target="_blank">page by Chi Feng</a> is a great resource.</p>
<p>The following list covers mainly the samplers included by default in Turing. There’s a lot of work in Bayesian compuation with different algorithms or implementations of these algorithms which could lead to different conclusions.</p>
<ol>
<li><p>Metropolis Hastings (MH): Explores the space randomly. Extremely simple, extremely slow, but can “work” in most models. Mainly worth a try if everything else fails.</p></li>
<li><p>HMC/NUTS: Gradientbased exploration, meaning the parameter space needs to be differentiable. It’s fast if that’s true, and so almost always the right choice if you can make your model differentiable (and sometimes so much better that it’s worth making your model differentiable even if your initial model isn’t in order to use it, e.g. by marginalizing out discrete parameters.) There are relatively minor differences between NUTS and the default HMC algorithm.</p></li>
<li><p>Gibbs sampling: A “coordinateascent” like sampler which samples in blocks from conditional distributions. It used to be popular along with factorizable models where conditional distributions could be updated in closed form due to conjugacy. It’s still useful if you can do this, but slow for general models. Its main use now is for combining samplers, for example HMC for the differentiable parameters and something else for the nondifferentiable parameters.</p></li>
<li><p>SMC/“Particle Filtering”: A method based on importance sampling, reweighting draws from an initial draw and repeatedly updating. It is designed to work well if the proposal distribution and updates are close to the targets. The number of particles should be large for reasonable accuracy. Turing’s implementation does this parameter by parameter starting at the prior and updating, which is close to what you want for the most common intended use, state space models with sequential structure, which is the main use case where I would even consider this. That said, tuning the proposals is really important, and more customizable SMC methods are useful in many cases where one has a computationally tractable approximate posterior you want to update to be closer to an exact posterior. This tends to be modelspecific and not a good use case for generic PPLs, though.</p></li>
<li><p>Particle Gibbs (PG), or “Conditional SMC”: like SMC, but modified to be compatible with Metropolis sampling steps. Its main use I can see is as a step in a Gibbs sampler, where it can be used for a discrete parameter, with HMC for the other parts. The number of particles doesn’t have to be overwhelmingly large, due to sampling, but it can cause problems if the number is too small.</p></li>
<li><p>Stochastic Gradient methods (SGLD/SGHMC/SGNHT): Gradient based methods that subsample the data to get less costly but less accurate gradients for an approximation of deterministic gradient based methods (SGHMC approximates HMC, SGLD approximates Langevin descent which also uses gradients but is simpler and usually slightly worse than HMC). These are designed for large data applications where going through a huge data set each iteration may be infeasible. They are popular for Bayesian neural network applications, where optimization methods also rely on data subsampling.</p></li>
<li><p>Variational Inference: Not a sampler per se. It comes up with a parametric model for the posterior shape and then optimizes the fit to the posterior according to a computationally feasible criterion (ie, one which doesn’t require computing the normalizing constant in Bayes rule), allowing you to optimize instead of sampling. In general, this has no guarantee of reaching the true posterior, no matter how long you run it, but if you want a slightly wrong answer very fast it can be a good choice. It’s also popular for Bayesian neural networks, and other “big” models like high dimensional topic models.</p></li>
</ol>

On Online Learning for Economic Forecasts
/post/ononlinelearningforeconomicforecasts/
Tue, 29 Oct 2019 00:00:00 +0000
/post/ononlinelearningforeconomicforecasts/
<p>Jérémy Fouliard, Michael Howell, and Hélène Rey have just released <a href="http://conference.nber.org/conf_papers/f130922.pdf">an update of their working paper</a> applying methods from the field of Online Learning to forecasting of financial crises, demonstrating impressive performance in a difficult forecasting domain using some techniques that appear to be unappreciated in econometrics. Francis Diebold provides <a href="https://fxdiebold.blogspot.com/2019/10/machinelearningforfinancialcrises.html">discussion</a> and <a href="https://fxdiebold.blogspot.com/2019/10/onlinelearningvstvpforecast.html">perspectives</a>. This work is interesting to me as I spent much of the earlier part of this year designing and running a <a href="/Forecasting.html">course on economic forecasting</a> which attempted to offer a variety of perspectives beyond the traditional econometric approach, including that of Online Learning.<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a> This perspective has been widely applied by machine learning practitioners and businesses that employ them, particularly major web companies like <a href="https://ai.google/research/pubs/pub41159">Google</a>, <a href="https://vowpalwabbit.org/">Yahoo, and Microsoft</a>, but has not seen widespread takeup by more traditional economic forecasting consumers and practitioners like central banks and government institutions.</p>
<p>The essence of the online learning approach has less to do with particular algorithms (though there are many), but instead reconsiders the choice of <a href="Forecasting/Evaluation.html">evaluation framework</a>. Consider a quantity to be forecast <span class="math inline">\(y_{t+h}\)</span>, like an indicator equal to 1 in the presence of a financial crisis. A forecasting rule <span class="math inline">\(f(.)\)</span> depending on currently available data <span class="math inline">\(\mathcal{Y}_T\)</span> produces a forecast <span class="math inline">\(\widehat{y}_{t+h}=f(\mathcal{Y}_T)\)</span> which can be evaluated ex post according to a loss function <span class="math inline">\(\ell(y_{t+h},\widehat{y}_{t+h})\)</span> which measures how close the forecast was to being correct. Since we don’t know the true value <span class="math inline">\(y_{t+h}\)</span> until it is observed, to make a forecast we must come up with a criterion instead which compares rules. Traditional econometric forecasting looks at measures of statistical <em>risk</em>,</p>
<p><span class="math display">\[E[\ell(y_{t+h},\widehat{y}_{t+h})]\]</span></p>
<p>where expectation is taken with respect to a (usually unknown) true probability distribution. Online learning, in contrast, aims to provide estimators which have low <em>regret</em> over sequences of outcomes <span class="math inline">\(\{y_{t+h}\}_{t=1}^{T}\)</span> relative to a comparison class <span class="math inline">\(\mathcal{F}\)</span> of possible rules,
<span class="math display">\[\text{Regret}(\{\widehat{y}_{t+h}\}_{t=1}^{T})=\sum_{t=1}^{T} \ell(y_{t+h},\widehat{y}_{t+h})\underset{f\in\mathcal{F}}{\inf}\sum_{t=1}^{T}\ell(y_{t+h},f(\mathcal{Y}_{t}))\]</span></p>
<p>This criterion looks a little odd from the perspective of traditional forecasting rules: <a href="https://fxdiebold.blogspot.com/2017/02/predictivelossvspredictiveregret.html">Diebold has expressed skepticism</a>. First, regret is a relative rather than absolute standard; to even be defined you need to take a stand on rules you might compare to, which can look somewhat arbitrary. If you choose a class of rules that predict poorly, a low regret procedure will not do well in an absolute sense. Second, there is no expectation or probability, just a sequence of outcomes. Diebold frames this as ex ante vs ex post, as the regret cannot be computed until <em>after</em> the data is observed, while risk can be computed without seeing the data. But this does not accord with how regret is applied in the theoretical literature. Risk can be computed only with respect to a probability measure, which has to come from somewhere. One can build a model and ask that this be the “true” probability measure describing the sequence generating the data, but this is also unknown. To get ex ante results for a procedure, one needs to take a stand on a model or class of models. Then one can evaluate results either <em>uniformly</em> over models in the class (this is the classic <a href="Forecasting/StatisticalApproach.html">statistical approach</a>, used implicitly in much of the forecasting literature, like Diebold’s <a href="https://www.sas.upenn.edu/~fdiebold/Teaching221/econ221Penn.html">textbook</a>) or <em>on average</em> over models, where the distribution over which one averages is called a <em>prior distribution</em> and leads to <a href="Forecasting/Bayes.html">Bayesian forecasting</a>. In the online learning context, in contrast, one usually seeks guarantees which apply for <em>any</em> sequence of outcomes, as opposed to over a distribution. So the results are still ex ante, with the difference being whether one needs to take a stance on a model class or a comparison class. There are reasons why one might prefer either approach. For one, <a href="https://itzhakgilboa.weebly.com/uploads/8/3/6/3/8363317/gilboa_notes_for_introduction_to_decision_theory.pdf">standard decision theory</a> requires use of probability in “rational” decision making. But the probabilistic framework is often extremely restrictive in terms of the guarantees it provides on the type of situations in which a procedure will perform well. In general, one must have a model which is correctly specified, or at least not too badly misspecified.</p>
<p>Especially for areas where the economics is still in dispute, the confidence that one has that the models available to us encompass all likely outcomes maybe shouldn’t be so high. This is the content of the Queen’s question to which the title of the FHR paper refers: many or most economists before the financial crisis were using models which did not foresee a particularly high probability of such an event. For that reason, a procedure which allows us to perform reasonably over <em>any</em> sequence of events, not just those likely with respect to a particular model class, may be particularly desirable; a procedure with a low regret guarantee will do so, and be known to do so <em>ex ante</em>, as long as there is some comparator which performed well <em>ex post</em>. Ideally, we would like to remove that latter caveat, but as economists like to say, there is <a href="https://en.wikipedia.org/wiki/No_free_lunch_theorem">no free lunch</a>. One can instead do analysis based on risk without assuming one has an approximately correct model; this is the content of <a href="https://books.google.com/books?hl=en&lr=&id=EqgACAAAQBAJ&oi=fnd&pg=PR7&dq=Vapnik+statistical+learning+theory&ots=g3KhtaV29&sig=5p6V7MW49xnzKUQGoAf7gRJZow0#v=onepage&q=Vapnik%20statistical%20learning%20theory&f=false">statistical learning theory</a>. But this usually involves both introducing a comparison class of models <span class="math inline">\(\mathcal{F}\)</span> to study a relative criterion, the <em>oracle risk</em> <span class="math inline">\(E[\ell(y_{t+h},\widehat{y}_{t+h})]\underset{f\in\mathcal{F}}{\inf}E\ell(y_{t+h},f(\mathcal{Y}_t))\)</span>, or variants thereof.<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a> This requires both a comparison class and some restrictions on distributions to get uniformity; Vapnik considered the i.i.d. case, which is unsuitable for most time series forecasting applications; extensions need some version of stationarity and/or <a href="https://papers.nips.cc/paper/3489rademachercomplexityboundsfornoniidprocesses">weak</a> <a href="https://projecteuclid.org/download/pdf_1/euclid.aop/1176988849">dependence</a>, or strong conditions on the class of nonstationary processes allowed, which can be problematic when one does not know what kind of distribution shifts are likely to occur.</p>
<p>This brings us to the content of the forecast procedures used: FHR apply classic Prediction with Expert Advice algorithms, like versions of Exponential Weights (closely related to the “Hedge” algorithm of <a href="http://rob.schapire.net/papers/FreundSc95.pdf">Freund and Schapire 1997</a>) and Online Gradient Descent (<a href="https://www.aaai.org/Papers/ICML/2003/ICML03120.pdf">Zinkevich 2003</a>), which take a set of forecasts and form a convex combination of them with weights that update each round of predictions. Diebold <a href="https://fxdiebold.blogspot.com/2019/10/onlinelearningvstvpforecast.html">notes</a> that these are essentially versions of <a href="Forecasting/ModelCombination.html">model averaging procedures</a> which allow for timevarying weights, which are frequently studied by econometricians, complaining that “ML types seem to think they invented everything”. To this I have two responses. First, on a credit attribution level, the online learning perspective originates in the studies of sequential decision theory and game theory from people like Wald and Blackwell, squarely in the economics tradition, and the techniques became ubiquitous in the ML field through <a href="http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf">“Prediction, Learning, and Games”</a>, by CesaBianchi and Lugosi, the latter of whom is in an Economics department. So if one wants to claim credit for these ideas for economics, go right ahead. Second, there are noteworthy distinctions between these ideas and statistical approaches to forecast combination. Next, the uniformity over sequences of the regret criterion ensures that not only does it permit changes over time, these can be completely arbitrary and do not have to accord with a particular model of the way in which the shift occurs. So while the approaches can be analyzed in terms of statistical properties, and may correspond to known algorithms for a particular model, the reason they is used is to ensure guarantees for arbitrary sequences, a property which is not shared by general statistical approaches. In fact, a classic result in online model combination (Cf Section 2.2 in <a href="https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf">ShalevShwartz</a>) shows that some approaches with reasonable risk properties, like picking the forecast with the best performance up to the current period, can give unbounded regret for particularly bad sequences. The fact that a combination procedure adapts to these “poorly behaved” sequences is more important than the fact that it gives time varying weights per se.</p>
<p>For these reasons, I think Online Learning approaches at minimum deserve more use in economic forecasting and I am pleased to see the promising results of FHR, as well as the growing application of minimax regret criteria in other areas of economics like <a href="https://arxiv.org/abs/1909.06853">inference and policy</a> under partial identification and areas like <a href="http://yingniguo.com/wpcontent/uploads/2019/09/slidesregulation.pdf">mechanism design</a> where providing a wellspecified distribution over outcomes can be challenging.</p>
<p>There are still many issues that need more exploration, and there are important limitations. One thing existing online methods do not handle well is fully unbounded data; the worst case over all sequences leads to completely uninformative bounds, even for regret. For this reason, forecasting indicators is a good place to start. Whether it is even possible to extend these methods to data with unknown trends is still an open question, which may limit their suitability for many types of economic data. Tuning parameter selection is a topic of active research, with plenty of work on adapting these to the interval length and data features. Methods which perform well by regret criteria but also adapt to the case in which one does have stationary data and so could do well with a modelbased algorithm are also a potentially promising direction. If one has real confidence in a model, it makes sense to rely on it, in which case statistical approaches are fine. But for many applications where the science is less settled and one might plausibly see data that doesn’t look like any model we have written down, we should keep this in our toolbox.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>For a better overview of the field than I can provide, see the survey of <a href="https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf">ShalevShwartz</a>, the monograph of <a href="https://ocobook.cs.princeton.edu/OCObook.pdf">Hazan</a>, or courses by <a href="http://www.mit.edu/~rakhlin/6.883/">Rakhlin</a> or <a href="http://www.mit.edu/~rakhlin/courses/stat928/">Rakhlin and Sridharan</a>.<a href="#fnref1" class="footnoteback">↩</a></p></li>
<li id="fn2"><p>Econometricians are used to thinking of this from the perspective of misspecification a la <a href="https://www.jstor.org/stable/1912526">Hal White</a> in which one compares to risk under the “pseudotrue” parameter value of the best predictor in a class. An alternative, popular in machine learning, is to use a data dependent comparator, the <em>empirical risk</em>, and prove bounds on the generalization gap. Here again we are effectively using the performance of a model or algorithm class for a relative measure.<a href="#fnref2" class="footnoteback">↩</a></p></li>
</ol>
</div>

An Empirical Heterogeneous Agents Models Reading List
/post/anempiricalheterogeneousagentsmodelsreadinglist/
Tue, 20 Nov 2018 00:00:00 +0000
/post/anempiricalheterogeneousagentsmodelsreadinglist/
<p>Inspired by a request by <a href="https://khakieconomics.github.io/" target="_blank">Jim Savage</a> asking for examples of recent work using heterogeneous agent models, I’ve put together a far from comprehensive list of papers demonstrating the range of work being done using these tools to understand a variety of issues at the intersection of macroeconomics and microeconomic data. While the field has a ways to go in terms of econometric modeling, the best recent work involves much more substantial use of data to discipline results and compare alternative hypotheses. While the days of putting together a model with some halfbaked mechanism, “calibrating” a few parameters to whatever values some old person used in a paper that got published, and showing a table of randomly chosen model and data moments to compare for which a J test would soundly reject equality but which for some reason you call a “good fit” are, um, not quite over, recent work now commonly involves actually writing down models which can encompass multiple competing mechanisms, collecting microeconomic data which directly speaks to those mechanisms, and using that data and model to quantitatively evaluate the results. Some of the work is even capable of putting standard errors on the estimates!</p>
<p>The following short list gives a taste of where the field has been moving, and I have some hope that it will continue to move further in this direction.</p>
<h1 id="incomedistributionandtheeconomy">Income Distribution and the Economy</h1>
<p>Handbook Chapter bringing up to date and empirically evaluating the research efforts descending from foundational work of <a href="http://perseus.iies.su.se/~pkrus/ref_pub/250034.pdf" target="_blank">Krusell and Smith 1998</a>:</p>
<p>Krueger, Mitman, and Perri “<a href="https://www.dropbox.com/s/y9yv228pnaaie4i/HandbookKMP.pdf?raw=1" target="_blank">Macroeconomics and Household Heterogeneity</a>” 2016</p>
<h1 id="incomeandwealthdistributionmodeling">Income and Wealth Distribution Modeling</h1>
<p>Gabaix, Lasry, Lions, Moll “<a href="https://scholar.harvard.edu/files/xgabaix/files/dynamics_of_inequality.pdf" target="_blank">The Dynamics of Inequality</a>” <em>ECTA</em> 2016</p>
<p>Hubmer, Krusell, and Smith “<a href="https://economics.yale.edu/sites/default/files/files/Graduate/Uploads/HubmerKrusellSmith_Wealth2017.pdf" target="_blank">The Historical Evolution of the Wealth Distribution: A
QuantitativeTheoretic Investigation</a>” 2017</p>
<h1 id="monetary">Monetary</h1>
<p>Most influential recent paper:</p>
<p>Kaplan, Moll, and Violante “<a href="http://violante.mycpanel.princeton.edu/Workingpapers/HANK_revision_MASTER.pdf" target="_blank">Monetary Policy According to HANK</a>” <em>AER</em> 2018</p>
<p><em>See also</em>:</p>
<p>Auclert “<a href="http://web.stanford.edu/~aauclert/mp_redistribution.pdf" target="_blank">Monetary Policy and the Redistribution Channel</a>” 2018
and <a href="https://aauclert.people.stanford.edu/research" target="_blank">several other papers by Auclert</a></p>
<p>Gornemann, Kuester and Nakajima “<a href="https://drive.google.com/file/d/0BxRm9kW6_YBqWFRrajRLWGJ0SHc/view" target="_blank">Doves for the Rich, Hawks for the Poor? Distributional Consequences of Monetary Policy</a>” <em>ECTA forthcoming</em></p>
<h1 id="development">Development</h1>
<p>The <a href="http://townsendthai.mit.edu/" target="_blank">Townsend Thai Project</a>, which collects extremely detailed spending, income and credit data from rural Thai villages, has inspired a large number of <a href="http://townsendthai.mit.edu/papers/" target="_blank">papers using the data</a> along with quantitative heterogeneous agent models to understand credit in rural economies. See as an example</p>
<p>Buera, Kaboski, and Shin “<a href="https://www3.nd.edu/~jkaboski/BKS_MacroMicro.pdf" target="_blank">The Macroeconomics of Microfinance</a>” 2017</p>
<h1 id="additionaltopics">Additional Topics</h1>
<p>Literally anything by <a href="https://voices.uchicago.edu/vavra/research/" target="_blank">Joe Vavra</a> is several standard deviations in quality above the rest of the field in terms of careful, detailed empirical modeling. See, eg his work on <a href="https://cpbusw2.wpmucdn.com/voices.uchicago.edu/dist/7/914/files/2018/03/housing8162017riidnybody23xw55t.pdf" target="_blank">House Prices and Consumer Spending</a> or <a href="https://cpbusw2.wpmucdn.com/voices.uchicago.edu/dist/7/914/files/2018/03/econometrica_body2kbvfl4.pdf" target="_blank">Durables Consumption over the Business Cycle</a> and his work using BLS micro pricing data with <a href="https://cpbusw2.wpmucdn.com/voices.uchicago.edu/dist/7/914/files/2018/03/qje_final_online2asbwxj.pdf" target="_blank">heterogenous firm models of price dispersion</a>.</p>
<p><a href="https://sites.google.com/site/kyleherkenhoff/research" target="_blank">Kyle Herkenhoff</a> is another researcher doing detailed empirical work in this field, with a focus on consumer credit.</p>

Solution of Rational Expectations Models with Function Valued States
/publication/functionvaluedstates/
Mon, 15 Jan 2018 00:00:00 0500
/publication/functionvaluedstates/
<p>Previous versions of this paper circulated under the title <em>On the Solution and Application of Rational Expectations Models with FunctionValued States</em></p>

Top Papers 2017
/post/toppapers2017/
Thu, 07 Dec 2017 00:00:00 +0000
/post/toppapers2017/
<p>
Inspired by Paul GoldsmithPinkham and following on <a href="https://www.bloomberg.com/view/articles/20171204/thebestbooksandresearchoneconomicsin2017">Noah Smith</a> and others in an end ofyear tradition, here is a notquiteordered list of the top 10ish papers I read in 2017. I read too many arxiv preprints and older papers to choose ones based on actual publication date, so these are chosen from the “Read in 2017” folder of my reference manager, which tells me that I have somehow read 176 papers (so far) in 2017. There was a lot of chaff in this set, and many more good works still sitting in my “to read” folder, but I did manage to find a few gems, which I can list and briefly describe, in sparselypermuted alphabetical order.
</p>
<ol style="liststyletype: decimal">
<li>
Arellano, Blundell, & Bonhomme <a href="http://www.ucl.ac.uk/~uctp39a/ABB_Ecta_May_2017.pdf">Earnings and Consumption Dynamics: A Nonlinear Panel Data Framework</a> Econometrica 2017
</li>
</ol>
<p>
This paper solves a problem which one would think would have been tackled a long time ago, but turned out to require several modern tools. What is the reduced form implied by intertemporally optimizing dynamic decision models of the kind underlying quantitative heterogeneous agent macro models, and can it be identified and estimated from micro panel data? They show that the form is a nonparametric Hidden Markov Model, and given long enough panels and some standard completeness conditions the distributions can be nonparametrically identified and estimated using conditional distribution estimation methods based on sieve quantile regressions. This seems like a good start to taking seriously the implications of these models and reformulating them to match micro data.
</p>
<p>
See also: their <a href="http://www.cemfi.es/~arellano/AR_Survey_Revised_2.pdf">survey</a> in the Annual Review of Economics showing how most dynamic models used in macro correspond to their framework.
</p>
<ol start="2" style="liststyletype: decimal">
<li>
Susan Athey & Stefan Wager <a href="http://arxiv.org/abs/1702.02896">Efficient Policy Learning</a>
</li>
</ol>
<p>
Part of a growing literature on learning optimal (economic) policies from data by directly estimating the policy to maximize welfare, rather than learning model parameters which are only ex post used in policy exercises without proper accounting for uncertainty. This paper focuses on the binary policy case, in the presence of unconfounded observational data on program results: find a policy rule <span class="math inline"><span class="math inline">\(\pi(X)\)</span></span> out of some class of policies which apply a program or not to an individual with covariates X. This paper takes a minimax approach, using a doubly robust estimator and showing approximately optimal approximation bounds on regret (relative to the unknown best policy in the class) over possible nonparametric structural parameters, using some novel bounds. These bounds rely strongly on the binary structure, so extending to more complicated policy problems may take some work, but the approach seems highly promising.
</p>
<p>
See also: <a href="https://arxiv.org/abs/1704.06431">Luedtke and Chambaz</a> who claim to achieve faster (<span class="math inline"><span class="math inline">\(\frac{1}{n}\)</span></span> instead of <span class="math inline"><span class="math inline">\(\frac{1}{\sqrt{n}}\)</span></span>) rates in the same problem. This appears to have motivated an update to the original version of the Athey Wager paper showing that the <span class="math inline"><span class="math inline">\(\frac{1}{\sqrt{n}}\)</span></span> rate and their bound is (worst case) optimal in the shrinking signal regime, where the size of the treatment effect function is of the same order as the noise, when the measurement issue is of first order importance for policy, unlike in the fixed signal size regime, where statistical uncertainty has lower order effect.
</p>
<ol start="3" style="liststyletype: decimal">
<li>
Max Kasy <a href="https://scholar.harvard.edu/kasy/publications/optimaltaxationmachinelearning">“A Machine Learning Approach to Optimal Policy and Taxation”</a>
</li>
</ol>
<p>
In the same vein as above, but allows continuous instead of binary policies, and takes a Bayesian instead of a minimax approach, advocating nonparametric Gaussian process methods for estimating unknown policy effects, which are shown to fit naturally in many optimal policy problems, and allow straightforward Bayesian decision analysis to be used, which has advantages for composition with other types of problems.
</p>
<ol start="4" style="liststyletype: decimal">
<li>
Andrii Babii <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2962746">Honest confidence sets in nonparametric IV regression and other ill ‐ posed models</a>
</li>
</ol>
<p>
Confidence bands for NPIV and similar Tikhonovregularized inverse problems. It’s wellknown that Tikhonov methods are optimal only up to a certain level of illposedness, and these bounds may be a bit conservative, but results should be quite useful in a variety of settings.
</p>
<ol start="5" style="liststyletype: decimal">
<li>
Beskos, Girolami, Lan, Farrell, & Stuart <a href="http://arxiv.org/abs/1606.06351">Geometric MCMC for InfiniteDimensional Inverse Problems</a>
</li>
</ol>
<p>
Hamiltonian MonteCarlo for nonparametric Bayes! Introduces a family of dimensionfree MCMC samplers which generalize the hits of finite dimensional MCMC, from metropolis Hastings to MALA to HMC, to the case of an infinitedimensional parameter. The issue of mutual singularity of infinitedimensional measures requires the posterior to be nonsingular with respect to the original, e.g., Gaussian Process prior (meaning, among other things, no tuning of length scale parameters), but within this restricted class, it allows models based on nonlinear transformations which make it applicable to computational inverse problems like PDE models and so greatly expands the class of feasible nonparametric Bayesian methods without relying on conjugacy or variational inference or many of the similar processdependent or poorly understood tricks used to extend Bayes to the highdimensional setting.
</p>
<p>
See also: Betancourt <a href="http://arxiv.org/abs/1701.02434">A Conceptual introduction to Hamiltonian Monte Carlo</a>, a brilliant and beautifully illustrated overview of how (finitedimensional) HMC works and how to apply it.
</p>
<ol start="6" style="liststyletype: decimal">
<li>
Chen, Christensen, Tamer <a href="http://arxiv.org/abs/1605.00499">MCMC Confidence Sets for Identified Sets</a>
</li>
</ol>
<p>
A clever idea for inference in partially identified models. They note that in many cases, even when structural parameters are nonidentified, the identified set itself is a regular parameter (or deviates from regularity in a tractable way) in the sense of inducing a (possibly nonclassical) Bernstein von Mises theorem, so the Bayesian posterior for the identified set itself (though not the parameter) converges to a wellbehaved distribution. For models defined by a (quasi)likelihood, these sets are essentially the minima of the criterion function, and so CCT show that in the limit, the quantiles of the likelihood in MCMC samples from the (quasi)posterior define a valid frequentist confidence interval for the identified set. As the fully Bayesian posterior will in general concentrate inside the identified set (this is what having prior information means), one can then easily extract both Bayesian and valid frequentist assessments of uncertainty from the same MCMC chain, even without identification. This approach is conservative if you want inference for the parameter itself, for which the available methods often involve a bunch of delicate tuning parameter manipulation, and does not seem applicable to many cases where the identified set is itself irregularly identified, or in many cases of weak identification, which seems dominant in the kinds of models I tend to work on, but this is why this field is so active.
</p>
<p>
See also: Too many papers to list on weak or partial identification in specific settings. Andrews & Mikusheva (2015) and Qu (2014) on DSGEs.
</p>
<ol start="7" style="liststyletype: decimal">
<li>
Rachael Meager <a href="http://economics.mit.edu/files/12292">Aggregating Distributional Treatment Effects : A Bayesian Hierarchical Analysis of the Microcredit Literature</a>
</li>
</ol>
<p>
A spearhead in a hopefully coming paradigm shift in economics toward taking seriously the issues of “tiny data” in the bigdata era. Some clever mix of economically and structurally informed parametric modeling and Bayesian computation for aggregating information in quantile treatment curves across studies, with an application based on 7 data points (!) which themselves are estimated functions derived from mediumscale field experiments. This hierarchical paradigm of sharing information between flexibly modeled components for parts where data is abundant and more judiciously parameterized components where it is not, in a seamless way, seems to characterize a useful and principled approach to an omnipresent situation in economic data with applications far beyond the program evaluation context.
</p>
<ol start="8" style="liststyletype: decimal">
<li>
Ulrich Müller & Mark Watson <a href="http://www.princeton.edu/~umueller/lfcorr.pdf">Long Run Covariability</a>
</li>
</ol>
<p>
Speaking of tiny data, this is the latest in Müller and Watson’s series on long run and low frequency modeling for time series, using cosine expansions to explicitly bring out the smallsample nature of the problem of learning about longrun behavior, given that we are only getting new data points on 50 year periods approximately every 50 years. This paper extends from univariate to multivariate modeling, offering a simple alternative to cointegration based approaches which restrict our modeling of long term relationships based on a very particular parametric structure. The new methods don’t alleviate the need for simple parametric models in this tiny data space, but they do show explicitly how the problem is that of fitting a curve to a scatterplot with 1012 points at most, and so permit use of explicit small sample methods to describe and analyze the data.
</p>
<p>
See also: their survey paper on this research agenda, <a href="http://www.princeton.edu/~umueller/ULFE.pdf">“Low Frequency Econometrics”</a>
</p>
<ol start="9" style="liststyletype: decimal">
<li>
Mikkel PlagborgMøller <a href="https://scholar.princeton.edu/sites/default/files/mikkelpm/files/irf_bayes.pdf">Bayesian Inference on Structural Impulse Response Functions</a>
</li>
</ol>
<p>
An alternative to Bayesian VARs, which put priors on autoregression coefficients and then derive IRFs under some bizarre exact or partial exclusion restrictions, you can put priors on IRFs directly, essentially by putting priors on the moving average representation instead. This also has some nice asymptotic theory, showing that the autocovariances are the part which is actually identified, and the posterior converges to a distribution over the set of IRFs consistent with a given spectrum, with priors providing weights inside of that by marginalization to the identified set. The results here impose a finite order vector MA representation, which reduces comparability with VARbased methods, though it doesn’t seem obviously worse.
</p>
<ol start="10" style="liststyletype: decimal">
<li>
AlShedivat, Wilson, Saatchi, Hu, and Xing <a href="http://arxiv.org/abs/1610.08936">Learning Scalable Deep Kernels with Recurrent Structure</a>
</li>
</ol>
<p>
While MCMC methods are now becoming practical for the small data regime in Bayesian nonparametrics, and conjugacy based methods like Gaussian process regression allow work in the medium data regime, for even moderately large data sets, the cubic complexity of these methods makes them impractical, and so a huge literature on simplifications or approximations has developed. Restricting covariance processes allows reduction to quadratic (certain kernel approximations), linearithmic (spectral methods) or even linear time, but many of these methods cost a lot in terms of expressivity. A Gaussian process is already modeling the mean of the unobserved points as a linear function of the observed values, and approximations strongly restrict the coefficients. This paper, by among others, some folks at CMU, offers a linear time GP approximation method which seems to offer a nice compromise. Using choice of points to approximate and some interpolation techniques, they can get the numerical approximation costs down a lot, but the method allows for quite complex kernels, including, in this case, a kernel parametrized by a (recurrent) neural network optimized along with the process, which allows quite a bit of flexibility. This kind of merger of classical nonparametric Bayes and neural network methods seems very promising, and this is just one bit of an explosion of approaches to combining the two methods, but this seems like a very practical one.
</p>
<p>
See also: Many papers from the Blei lab at Columbia, with a variety of approaches for speeding up Bayes, often relying on variational inference, which is based on approximating posteriors via optimization of (normalizationconstantfree) lower bounds. The Blei lab has a lot of work attempting to turn variational inference from black magic into science, though I still find it all quite mysterious. <a href="http://arxiv.org/abs/1601.00670">This tutorial</a> isn’t a bad intro for applications. The <a href="http://www.nowpublishers.com/article/Details/MAL001">classic book by Wainwright and Jordan</a>, which offers a theoretical perspective, is an imposing weight on my toread list.
</p>
<div id="takeaways" class="section level2">
<h2>
Takeaways
</h2>
<p>
This year was big on Bayes for me, reflecting both my research interests and the movement in the profession, which is looking at principled approaches to mixing theory and data to allow the data to take precedence when it is abundant (nonparametric methods) and let theory pull some weight in the subcomponents of the problem where it is not (macro time series, for one, causal inference in observational data for another). This also reflects the longawaited publication of the Ghosal and van der Vaart Bayesian Nonparametrics textbook, which sent me down several lines of inquiry and displaced reading papers for me for much of the summer.
</p>
</div>

Computational Methods for Economic Models with Function Valued States
/publication/thesis/
Sun, 01 May 2016 00:00:00 0400
/publication/thesis/
<p>Portions of this PhD thesis have been adapted into the paper <em>Solution of Rational Expectations Models with Function Valued States</em>. The thesis contains additional results including a very different algorithm for the noncompact case, expanded analysis of the economic geography model, and additional numerical applications.</p>