Why Laplacians?

Jun 3, 2015

Attention Conservation Notice: Over 5000 words about math that I don’t particularly understand, written mostly to clarify my thoughts. A reader familiar with the topic (roughly, spectral or harmonic theory on graphs and manifolds) will find little here new except possibly misconceptions, while a reader not familiar with the topic will find minimal motivation and poorly explained jargon. The ideal reader is a pedant or troll who can tell me why I’m wrong about everything. Otherwise, if you want to learn the topic, maybe go read a random article instead?

In the rapidly developing realm of data analysis on graphs and manifolds, as well as in a variety of fields of applied and maybe pure mathematics, a primary concern is characterizing the geometric structure of a space and describing the properties of the objects that live there.

A typical example of a question this field tries to answer is manifold estimation or unfolding: given a bunch of points, sampled from a (sub)manifold in a higher-dimensional space, construct a reasonable estimate of that lower-dimensional shape on which the points live. ¹ Though methods for the general case are rather new,² when we have a linear subspace this is simply the familiar problem of Principal Components Analysis or Factor Analysis, dating back to the 1920s.

Alternately, we may only be interested in some limited set of geometric features of the space, such as its connected components: this is (one way to look at) the problem of clustering. If we are more ambitious, we might want to associate some type of data to the points of a space, which we may or may not know ex ante. Depending on the type of data, and on the underlying space, this goes by various names. When the space is known, it can be manifold regression, graph regression or smoothing, density estimation on manifolds or graphs, or graph signal processing. When the structure is unknown, but assumed to be approximately linear, one has reduced rank regression, principal components regression, or some category of factor model. To the extent that the assignment of the data bears some relationship to the underlying space (for example, the assignment is continuous or differentiable, or some analogue thereof), estimating this relationship must also take into account some of the geometric features of the space.

In an area well-explored in differential geometry but only beginning to be explored in geometric data analysis, the object to be related to the space of interest may itself incorporate some structure which is related to the structure of the underlying space. For example, it may live on or be some kind of fiber bundle, such as a vector bundle or tangent bundle, which describe flows over a manifold (or graph) and their directions and speed. So, one may consider (archetypally) wind on the surface of the earth, or messages over a network, or workers changing jobs as moving around in the space of possible occupations.³ These spaces themselves have a manifold structure which can be estimated or exploited in characterizing the data. One can likewise imagine (if you’re the kind of person who does that sort of thing), extending this kind of analysis to more general structures.

Given the rather general nature of these spaces, one has many options for describing their geometric properties. A graph is defined in terms of its nodes and edges and a manifold in terms of its charts and transition maps, but these are hard to work with and we would like an object from which we can easily compute whatever properties we need of the space. Some space of objects isomorphic to (or, in a pinch, surjective onto) the space of all (relevant) manifolds or graphs seems like the ideal choice, and if one can easily extract summaries, in some sense, even better.

What are Laplacians?

While not universal, for a large portion of the people working in this field, the object of choice is overwhelmingly the Laplacian, denoted \(\Delta\) or \(\nabla \cdot \nabla\) or \(\nabla^2\), a differential (or, for graphs, difference) operator over functions defined on the space. This is usually defined as the divergence of the gradient: for a Euclidean space \(\mathbb{R}^n\), this is simply the sum of the second partial derivatives in each direction \(\Delta=\sum_{i=1}^n \partial_i^2\). Over a graph, if D is the matrix with the number of edges of each node (in some fixed but arbitrary order) on the diagonal and A the adjacency matrix with a 1 in position (i,j) if there is an edge between i and j, and a 0 otherwise (including on the diagonal), the Laplacian is simply \(L:=D-A\). While in the graph case it is easy to see that this object contains all the information in the structure of the graph, the form of a linear operator, and why this particular linear operator, always seemed a bit mysterious to me. So, I thought it would be worthwhile to lay out my thoughts on different ways of looking at this object that might explain how it’s used and why it takes pride of place in structured data analysis.

One issue is that there are many different definitions of the Laplacian because there are many different definitions of a derivative, gradient, and divergence, depending on the space in which they act and other desiderata. Given some definition of a gradient, divergence is usually defined as the adjoint, with the inner product determined by the space, and so many cases are identical up to the choice of space and derivative. For example, the graph case corresponds to taking differences over connected vertices.

Over manifolds, where the derivative no longer can be represented as an element of the space, there are at least two common formulations:

On Riemannian manifolds, the Laplace-Beltrami operator (or connection Laplacian) is defined in terms of the Levi-Civita connection, a covariant derivative in the direction of the tangent space, or equivalently as the trace of the Hessian (defined in terms of covariant derivatives).
Using instead the exterior derivative \(\partial\) and codifferential \(\delta\), which act on the exterior algebra of differential forms of a space, one can construct the Laplace-De Rham operator (or Hodge Laplacian) on differential forms as \(\partial\delta +\delta\partial\) .

If the graph Laplacian is seen as generalizing the Laplace-Beltrami operator to graphs, the combinatorial Laplacian can be seen as generalizing the Laplace-de Rham operator to simplicial complexes, higher-order structures over graphs, with exterior derivative and codifferential replaced by boundary and coboundary maps, respectively, with differences among faces connected by a coface or cofaces connected by a face.

Even more exotic spaces and derivatives give similar analogues, of varying degrees of applicability. Generating new Laplacians seems to be a hobby for mathematicians, so this is a highly incomplete list.

Malliavin calculus induces a gradient on stochastic processes, the Malliavin derivative, and a corresponding adjoint, the Skorohod integral, familiar in the case of adapted processes as the Itō integral. Successive application yields a stochastic analogue of the Laplacian, the Ornstein-Uhlenbeck operator, so named since it is the infinitesimal generator of a standard Ornstein-Uhlenbeck process, the “continuous AR(1)”
For stochastic processes defined over more general manifolds, one may generalize the “usual” Malliavin derivative and OU operator analogously to how one generalizes the Euclidean derivative to manifolds, though apparently some difficulties arise.
The Casimir operator plays the role of the Laplacian on a Lie algebra, and its eigenelements play a role in harmonic analysis of these spaces similar to the role played by eigenfunctions of the Laplacian in standard harmonic analysis.
The hypoelliptic Laplacian does… something. Honestly I was hopelessly lost ten minutes into this presentation.

Applications: How the Laplacian is Used

While Laplacians can be described for many spaces, data usually takes the form of a collection of points which may or may not be known to lie in (or near) the space of interest, which often must be estimated. To determine this space or functions on it, many procedures start by estimating the Laplacian or some of its properties, in particular its eigenvalues and eigenfunctions, and possibly some functions constructed from these functions, and these objects are taken to define the space of interest.

The way this works is as follows: from a set of points, if they do not already have a known graph structure, one can construct a (possibly weighted) graph by connecting nearby points (in some metric, usually Euclidean). One then takes the graph Laplacian of this graph and calculates its eigenvectors. Under some standard conditions, if more and more points are drawn from a compact submanifold in Euclidean space, the graph Laplacian converges to the Laplace-Beltrami operator of the manifold and the eigenvectors converge to its eigenfunctions. The proofs of this are remarkably straightforward, essentially copying convergence results for kernel density estimators.

When the manifold to be considered has intrinsic dimension smaller than that of the ambient space, it can be well described by only the first few eigenvectors, which define a coordinate system over the manifold. For the linear case, this is the principle of multi-dimensional scaling; in the nonlinear case, it is the idea behind Laplacian Eigenmaps and Diffusion Maps as manifold estimation methods. In defining a coordinate system on the manifold or graph, these vectors can also be used to parameterize distance in the data set, and so give a sense of what parts of the data are closely connected.

One of the most popular uses of this measure is to detect related communities or clusters in a data set. Spectral clustering takes the eigenvector corresponding to the second-smallest eigenvalue of the (possibly weighted or perturbed, depending on the estimation method) symmetrized graph Laplacian \(D^{-1/2} L D^{-1/2}\) and partitions it (via any standard clustering algorithm, usually k-means). If the graph has multiple connected components, this recovers them: there are also consistency guarantees for recovering communities which are not completely disconnected, such as the under the stochastic block model, or a variety of clustering measures.

Distance measures also give a definition of local which is intrinsic to the structure of the data and possibly lower dimensional than the ambient space, which makes them useful to describe what it means for a function on the data to be continuous or smooth: it should not vary too much over small distances. Since the Laplacian is a differential operator, it acts on differentiable functions, and its eigenfunctions are differentiable and can be taken to define what it means to be smooth over a graph or manifold. The sequence of eigenfunctions then provides a Fourier-type basis which can be used to approximate arbitrary smooth functions over the space on which the data concentrates. One may also decompose these functions into local components at multiple scales to generate versions of wavelet analysis. On a known manifold, such a construction generates needlets, on a fixed graph one can produce graph wavelets, and using the estimates of the manifold provided by diffusion maps one can generate diffusion wavelets. These can then be used like conventional wavelets for smooth function and density estimation and compression.

So why does it work?

One reason the Laplacian appears to be useful for data analysis is that it appears to be useful for tons of things: it is a core object in a number of mathematical subjects and captures the relationship between them. Each perspective provides a different justification for the Laplacian, and also suggests directions for generalization or improvement.

Functional Analytic

One way to see the Laplacian is that it is a particular example of a linear operator on a function space, with some nice properties (it is self adjoint) and some more peculiar properties (it is unbounded, acting continuously only on twice-differentiable functions). This small domain might also be an advantage, since it defines a subset of the class of all functions which is small⁴ but dense. This space, in the Euclidean case, is a classical Sobolev space. More usefully, the spectral decomposition of this operator, by self-adjointness, generates an orthonormal eigenbasis which provides basis functions which can provide uniformly accurate finite approximation of functions in the space. In the 1-dimensional periodic case this is the Fourier basis, and the eigenvalues order the basis functions by frequency. This can be seen by noting that the second derivative of a sine or cosine is again a sine or cosine, up to sign and scale. This fact can be taken as a definition of the Fourier transform, and provides one means of generalizing the transform to more general spaces like graphs or manifolds.⁵ Given the huge set of things one can do with a Fourier transform (filtering, wavelet analysis, sieve estimation, fast algorithms), the utility here is obvious.

The question which this approach raises is what is special about the Laplacian, as opposed to some other operator which could generate an ordered orthonormal basis of eigenfunctions. Here I don’t have the greatest intuition: certainly the use of a differential operator encodes smoothness in a way which seems natural, with the idea that local changes are not too large meaning that the second derivative is not excessively large (on average over the space).

Another heuristic justification for this kind of choice is that since the derivative gives local information, it is well-suited to applications based on data, since one will never obtain the value of a function except at a finite number of points, so if one wants global information, it is necessary to impose a structure which allows interpolating between points. However, this principle at best gives that a differential structure is sufficient, rather than necessary: ability to interpolate or estimate functions from points is controlled more by the size of the function space (according to some complexity measure) than smoothness per se, though the function class must be able to eventually distinguish functions based on evaluation at a finite but growing number of points.

One aspect of this space beyond low complexity which could be useful for data analysis is that it forms a reproducing kernel Hilbert space. In principle this indicates that estimation, interpolation, and inverse problem solution can take advantage of the explosion of computationally convenient procedures for spaces of this type based on Kimeldorf and Wahba’s representer theorem and Vapnik’s kernel trick that form the key ideas behind smoothing splines, support vector machines, and related kernel methods. In practice, these tools are mainly useful if one can easily calculate the reproducing kernel, which is why practitioners tend to rely on ad hoc choices like Gaussian, exponential, or polynomial kernels rather than the kernel induced by the Sobolev space. While it is possible to find this object for a specific space by solving a set of PDEs (Wahba does this for classical Sobolev spaces on the interval and on the sphere in her classic book on splines), it is again easier to apply methods which induce an easily computable kernel function, corresponding to tractable deformed versions of the Laplacian.

One property of the Fourier transform which is not in general preserved, at least for inhomogeneous spaces, is translation invariance. This is what makes Fourier analysis useful in the time-series context, preserving the features of a stationary series and allowing consistent inference from a single observed series. It might reasonably argued, though, that if the space does not have homogeneity properties of some sort, that one should not expect to be able to extrapolate outside of the range of observation, at least not without some kind of known or locally learnable global structure.⁶ Since the eigenfunctions of the Laplacian are defined globally, they do impose some structure on functions over areas where data is not observed, including places which are structurally different from the area of observation. On the other hand, since they are defined in terms of smoothness, these global features are less informative far away from the area of observation, and I suspect that proper uncertainty quantification would therefore reflect higher uncertainty in these regions. This is likely to be particularly true of wavelet methods, which have substantially better localization properties.

Probabilistic

A justification for Laplacian methods which provides a much stronger justification for the particular form of the operator is that it encodes a particular probabilistic model on the space. In the case of a graph, this is a standard random walk along the edges, with a normalized version of the Laplacian (\(I-D^{-1}L\)) being the Markov transition matrix for a process defined on the vertices and transitioning at random along the edges from that vertex. In the continuum limit, the eigenfunctions of the Laplacian characterize a diffusion process in the ambient space (in particular, they are the eigenfunctions of the associated Fokker-Planck operator), with drift determined by the density of points over the manifold.

The random walk structure helps elucidate a number of properties of the Laplacian in characterizing the structure of the space. First, since a random walk is a local process, over short time scales it moves locally. Over long time scales, over finite graphs, the walk converges to a limit distribution, which describes the global structure. This may fail to be unique if, for example, the graph is not connected. This interpretation also explains the role of the spectrum of the Laplacian. A steady state is an eigenvector of the Markov transition matrix corresponding to the eigenvalue 1, i.e. \((I-D^{-1}L)v=v\), which exists by the Perron-Frobenius theorem. For a Markov chain which does converge to a unique steady state, the speed of this convergence is given by the size of the second eigenvalue: if this is one, the process has multiple steady states, which occurs when there are disconnected components of the graph. If it is near one, convergence occurs very slowly, as would be expected if there are regions of a graph where a walk would remain stuck with high probability because there are few links outside and many links inside. These cases correspond to existence of disconnected or nearly disconnected clusters in a graph, and explain the usefulness of spectral clustering. Quantitatively, the control provided over the structure by this eigenvalue is given by the Cheeger inequality (note that due to sign conventions, the distance of the second largest eigenvalue from 1 of the Markov transition matrix corresponds to the distance of the second smallest eigenvalue from 0 for the Laplacian).

The idea of Laplacian as coming from a random walk also gives, to me, the most intuitive justification for why a second derivative is used: it comes from Itō’s lemma. As the step size goes to zero, a pure random walk on a space converges, by a functional CLT, to a standard Brownian motion, call it \(W_t\). Now consider an arbitrary twice differentiable function on for simplicity, \(\mathbb{R}^n\), \(f(.):\mathbb{R}^n\to\mathbb{R}\). By Itō’s lemma, Taylor expanding \(f(.)\), obtain \(df(W_t)= \frac{1}{2} \Delta f(W_t)dt+\nabla f(W_t)dW_t\), and we see that a random walk induces a drift for any function which is precisely (1/2 of) the Laplacian. Applied to the density \(p(x,t)\), this gives the adjoint Markov operator, or Fokker-Planck equation for the density, \(dp(x,t)=\frac{1}{2} \Delta p(x,t)dt\), the continuous analogue of the transpose of the Markov transition matrix, which instead of mapping state to state maps distribution over states to distributions over states.⁷

So, we can see that one question the Laplacian answers is, given a density over my space, how will it evolve under a random walk. This of course raises the question of why the process needs to be a pure random walk, as opposed to some other process, which might be driven by non-spherical noise, have some drift or jump component, or depend on the location in the space. The answer is that it depends on the application. If nothing about the local structure of the space is known, it makes sense to have a procedure which is locally agnostic. On the other hand, if one is using the induced decomposition of the space to characterize objects arising from a structurally biased process, it may be useful to take that into account. This is especially the case if the methods are being used to describe something which actually does diffuse over a network or manifold-like space, in which case the local distance ought to be in terms of the characteristics of this process, which may be asymmetric or behave differently at boundaries.⁸ Modifications here should be domain specific, and indeed procedures like spectral clustering have been adapted in this way to different characterizations of communities.⁹

As an example, consider the utility of a process consisting of a random walk plus jumps which land at a random point on the space. This will induce a non-local component to the movement, and so induce connections between disconnected or weakly-connected components (which may appear disconnected under sampling). This additional component provides a sort of protection against unmeasured connectedness, and induces the commonly used regularized Laplacian. In particular, this generates the random surfer model underlying Google’s PageRank algorithm, arguably the most successful application of spectral graph methods.

Differential equations

Being a differential operator, probably the most obvious perspective from which to consider the Laplacian is (partial) differential equations, in which second order equations, in which the Laplacian is (part of) the leading term, comprise the core of the subject. The Laplacian shows up in the canonical Poisson equation \(\Delta u(x) = f(x)\), the Heat equation \(\partial_t u(x,t) = \Delta u(x,t)\), the wave equation \(\partial_{tt} u(x,t) = \Delta u(x,t)\), and a large variety of equations with lower order terms, both linear and nonlinear.

In economics (or at least the branches with which I am most familiar), second order PDEs show up most frequently in stochastic optimization problems, in which some noise follows a diffusion process, and an agent chooses a control to maximize a discounted reward over time. With discount rate r, control \(c_t\), noise \(dx_t=\mu (x_t,c_t)dt+\Sigma dW_t\) and reward \(u(x_t,c_t)\), this leads to the HJB equation for the value conditional on any given state as \[rV(x_t,t)=\underset{c_t}{\max} \{u(x_t,c_t) + \partial_t V(x_t,t)+\langle\nabla_x V(x_t,t)), \mu (x_t,c_t)\rangle + \frac{1}{2}Tr(\Sigma^T H_x V(x_t,t)\Sigma)\}\] where the last term, the trace of the Hessian of \(V\), is the weighted Laplacian. One can add boundary conditions (which may also be controlled), constraints, and so on to describe the problem of interest. The simplest case, with a geometric Brownian motion as state variable, no instantaneous payoff, and the only control a binary choice over a terminal boundary condition \(V(x_T,T)=\max\{x_T-p,0\}\) for constant \(p\), gives the well-known Black-Scholes formula for option pricing, though the idea is applicable in more general contexts.

To the extent that solutions inherit features of the space from the second order differential structure, the value function \(V(x,t)\) should be characterizable in terms of the properties of the Laplacian. This idea has attracted attention in the field of reinforcement learning (usually used to describe the discrete time version of this optimization problem), using Laplacian eigenfunctions and derived wavelets as a basis for value function approximation. In cases where the structure of the space is an unknown manifold, this method has the additional advantage of incorporating nonlinear dimension reduction, potentially providing substantial computational advantages over structured domains. As in the probabilistic case, when the driving process is not a pure random walk, it may be advantageous to incorporate structure by using, say, the Fokker-Planck operator and not just the standard Laplacian. However, in the case of substantially nonlinear reward or control, the linear component may provide a poor guide to the behavior of the solution, unlike in the Black-Scholes case where it is exact.

For its uses in more general data analysis, the PDE which is most directly applicable is the Heat equation, \(\partial_t u(x,t) = \Delta u(x,t)\), which, up to a constant, is exactly what I called a Fokker-Planck equation above.¹⁰ Applying standard methods for solution of initial value problems then gives a semigroup which tells us for any t>0 the distribution resulting from any given initial state \(u_0 (x)\), in the form of the Heat Kernel, denoted \(e^{t\Delta}\), which on Euclidean space takes the form \(e^{t\Delta} [u_0](y) = \int_{\mathbb{R}^n}\phi(\frac{|x-y|}{2t})u_0 (x) dx\), where \(\phi\) is the standard Gaussian density on \(\mathbb{R}^n\).

In other words, the solution is convolution with an increasingly dispersed Gaussian, as should be expected from the interpretation of adding Gaussian noise to each point. This gives another way of seeing the connection with Fourier analysis, since by the convolution theorem, any convolution operator acts componentwise on the Fourier frequencies by multiplication. The heat kernel then can be viewed as a form of low-pass filter, which downweights high frequency components by an amount increasing in t. Spectral methods based on the Laplacian, which restrict to the first few eigenfunctions and discard the rest, sharpen this characterization, generating an exact (continuous) low-pass filter. From a statistical perspective, this can be seen as a series estimator of the density. In contrast, applying the heat kernel can be seen directly as a kernel density estimator: given an initial condition which is noisy (say, an empirical distribution), the heat kernel outputs a regularized version which is smooth: the time parameter plays the role of the bandwidth.

Statistical

The interpretation of Laplacian eigenfunctions and heat kernel smoothing as sieve and kernel methods respectively suggests evaluation of these methods for their statistical properties. Which of these methods is preferable depends on what one believes about the distribution over the graph or manifold, though as is generally the case with sieve versus kernel methods, asymptotic convergence rates are the same for smooth functions, and the decision of which to use usually comes down to other desiderata (e.g. the kernel method outputs a proper density, while the sieve filter is idempotent). More generally, the use of the heat equation picks out a single sieve basis (the Laplacian eigenfunctions) and a single kernel (the Gaussian), which can be compared to other sieves and kernels, which may have more desirable properties depending on the conjectured properties of the objects living on the space. It is well-known, for example, that the Gaussian kernel, as a proper kernel, yields suboptimal convergence rates for highly smooth densities. Since both produce isotropic representations, they also don’t deal with smoothness that may differ by direction (which is especially tricky in the non-Euclidean case since one often lacks canonical directions) or location. Sparse wavelet based methods, as usual, can help here, though it seems to me that proper analysis of optimal wavelet denoising over graphs and manifolds remains understudied.

As for more general alternatives, there are a large number of methods now in the machine learning and statistical literature, many of which rely on Laplacian based methods, via sieve or Reproducing Kernel Hilbert Space methods, the latter including several Bayesian approaches. Estimation over graphs and manifolds remains a hot topic, with a variety of approaches, only some of which rely on Laplacian structure. As in every other aspect of machine learning these days, neural networks over manifolds provide a promising alternative. One also has a choice between penalization methods of various kinds for graph-based predictions. The choice between these seems essentially similar to the choice of penalization in standard nonparametric prediction problems, with Laplacian-based methods corresponding to \(L^2\) approaches like splines and support vector machines: a tool in the toolbox and a good first choice for many problems, but not the last.¹¹

Topological

Beyond representing the space on which they live, part of the utility of Laplacian based methods is that important characteristics of the space can easily be “read off” of the Laplacian, especially the spectrum, for some definition of “important.” The use in clustering suggests that this is the degree of connection between the components of the space, a topological feature, or approximately one. In particular, it provides a measure of the 0th homology group of the space. This does not seem to be coincidental. The homology of a topological space (components, holes, voids, etc) can be determined by constructing a simplicial complex from the space, considering not just links between pairs of points, but sets of 3, or 4, and so on, as higher analogues of edges of a graph, which is a first order complex. The homology of each order is determined by the simplices of that order, so graph information is informative about the connected components. By extending the approximating space to higher orders, one can determine also the higher order homologies. Constructing a Laplacian over these higher order connections to obtain the combinatorial Laplacian then in fact gives an operator whose spectrum encodes the higher order homologies in the same way the graph Laplacian describes clustering.

In practice, there doesn’t seem to be a lot in the way of “spectral homology” to give higher order analogues of spectral clustering. Instead, topological data analysis relies mostly on versions of persistent homology, which computes the homology group directly from the complex constructed from connecting nearby points at each length scale. The difference seems to be due to different notions of robustness to noise. In spectral clustering, one claims to have clusters even if the graph is connected so long as there are relatively few links between components, with “few” measurable by a small second eigenvalue of the Laplacian. Persistent homology will instead only show existence of disconnected components if the graph actually is disconnected, for edges connected between sufficiently close points. If I understand this properly (not likely), the analogue for 1st homology group would be to say that there exists an “approximate” hole in the space if there are few paths across a gap relative to paths around it, where few could be measured by the second eigenvalue of a higher order Laplacian. Whether this is useful depends on the model for the space: for sampling exactly from a manifold, one will never get points in the “wrong” place and so persistent homology should describe its holes. If the data is instead drawn from a model with noise (interesting question: is there a simplicial analogue of the stochastic block model?), one will expect outliers and so even if data concentrates on a torus, the persistent homology may not find the bulk of the torus before the hole in the point cloud is completely filled in. To put it visually, the relative merit of these methods for different applications depends on whether you want your algorithm to tell you that a bialy is approximately a bagel, or approximately an onion roll.

Summary

Considering these methods, it gets easier to see why, if you knew not so much about a space, you would first take a look at the Laplacian over it. It describes local and global properties, gives a reasonable definition of smoothness and random processes on a space, and permits translation of the wide variety of second order (Hilbert space) statistical smoothing methods to strange-looking domains. It should be modified if a little more is known about the way things move on the space, and is probably exactly the wrong approach for considering at individual points, but it can tell us a lot if we just want to know how our data are put together.

For the purposes of data analysis, one usually thinks of a manifold this as a smooth curved, lower-dimensional shape in higher-dimensional Euclidean space, like a piece of paper wiggling in the air, or folded into a tube, or a wavy line drawn on a sheet of paper, or, maybe, the set of possible newspaper articles from the business section in the space of strings of letters or numbers. That last case may or may not satisfy the “smooth” or “low dimensional” characterization: we take such a geometric structure as a model, or hypothesis. The space can instead be defined without reference to Euclidean space, and probably ought to be, but we know, thanks to Nash, that this characterization doesn’t lose us anything.↩
To my knowledge, the recent explosion of interest in the idea of modeling data as living on an arbitrary submanifold begins with the 2000 publication in Science of Isomap and Locally Linear Embedding, though plenty of antecedents exist including various nonlinear generalizations of PCA, and the field of information geometry, which views the space of statistical models as a manifold.
↩
As usual, this generalizes a linear structure, the idea of the job ladder that workers move up or down, to allow motion in a more general space.
↩
An infinite dimensional space can be “small” if it can fit easily inside a bigger space, or if you can fill it up with not too many things.
↩
Other characterizations provide features which may preserve other useful properties of the Fourier transform in different kinds of spaces. Translation invariance and the convolution theorem, useful characteristics of the Fourier transform on the circle (viewable as a group with addition as the group operation), can be generalized to other groups by Pontryagin Duality, allowing also generalization of the idea of a stationary stochastic process and its spectral decomposition by Bochner’s theorem to processes invariant with respect to other transformations. For groups with a manifold structure, one has a choice of generalization, though in many cases the Fourier decomposition given by the Laplacian and by Pontryagin duality coincide, e.g., on a sphere, where both interpretations give the spherical harmonics.
↩
One could of course also say this about the time series case, as stationarity is a very strong assumption. This provides motivation for using instead methods which, to the extent that they use global structure, do not impose the specific global structure implied by stationarity: see e.g. Rakhlin and Sridharan on forecasting nonstationary processes on graphs.
↩
Does this result extend from \(\mathbb{R}^n\) to arbitrary Riemannian manifolds? Apparently, yes, with the Laplacian on \(\mathbb{R}^n\) replaced by the Laplace-Beltrami operator, as desired, though it seems to take a little bit of careful thinking to formulate a random walk with no preferred direction.
↩
To the extent that the stochastic structure is (partially) unknown, this induces an additional hyperparameter tuning problem to these methods, though one that seems no more or less amenable to standard methods than usual in nonparametric methods: cross-validation, empirical and hierarchical Bayes procedures used for Gaussian process estimation, and probably even Lepskii-type upper bound methods used for anisotropic kernels, all ought to have analogues. Shrinkage-based adaptive methods may be a bit less generally applicable, since while uncertainty about the process which involves pure rescaling can be accommodated with a fixed set of basis functions, generic perturbations to the Laplace operator will not preserve eigenfunctions and so related models will not have a hierarchical structure.
↩
Albeit mostly to versions of the stochastic block model, which is itself kind of a silly little toy model, but attracts interest as a starting point for much deeper theory.
↩
It is also, after a change of variables, the evolution equation for the price in the Black-Scholes model.
↩
Interestingly, given their use in unsupervised dimension-reduction approaches, it’s not clear how much of an advantage Laplacian-penalized approaches provide for prediction when the predictors are hypothesized to lie on (or near) an unknown manifold, even in the case where auxiliary data can be used to estimate the manifold structure. Lafferty and Wasserman find that standard kernel regression with a properly chosen bandwidth already achieves optimal low-dimensional rates if predictors lie on a manifold, and similar results exist for Gaussian Process regression. On the other hand, recent work has found more refined conditions under which knowledge of the manifold may help prediction.↩