In the rapidly developing realm of data analysis on graphs and manifolds, as well as in a variety of fields of applied and maybe pure mathematics, a primary concern is characterizing the geometric structure of a space and describing the properties of the objects that live there.
A typical example of a question this field tries to answer is manifold estimation or unfolding: given a bunch of points, sampled from a (sub)manifold in a higher-dimensional space, construct a reasonable estimate of that lower-dimensional shape on which the points live. 1 Though methods for the general case are rather new,2 when we have a linear subspace this is simply the familiar problem of Principal Components Analysis or Factor Analysis, dating back to the 1920s.
Alternately, we may only be interested in some limited set of geometric features of the space, such as its connected components: this is (one way to look at) the problem of clustering. If we are more ambitious, we might want to associate some type of data to the points of a space, which we may or may not know ex ante. Depending on the type of data, and on the underlying space, this goes by various names. When the space is known, it can be manifold regression, graph regression or smoothing, density estimation on manifolds or graphs, or graph signal processing. When the structure is unknown, but assumed to be approximately linear, one has reduced rank regression, principal components regression, or some category of factor model. To the extent that the assignment of the data bears some relationship to the underlying space (for example, the assignment is continuous or differentiable, or some analogue thereof), estimating this relationship must also take into account some of the geometric features of the space.
In an area well-explored in differential geometry but only beginning to be explored in geometric data analysis, the object to be related to the space of interest may itself incorporate some structure which is related to the structure of the underlying space. For example, it may live on or be some kind of fiber bundle, such as a vector bundle or tangent bundle, which describe flows over a manifold (or graph) and their directions and speed. So, one may consider (archetypally) wind on the surface of the earth, or messages over a network, or workers changing jobs as moving around in the space of possible occupations.3 These spaces themselves have a manifold structure which can be estimated or exploited in characterizing the data. One can likewise imagine (if you’re the kind of person who does that sort of thing), extending this kind of analysis to more general structures.
Given the rather general nature of these spaces, one has many options for describing their geometric properties. A graph is defined in terms of its nodes and edges and a manifold in terms of its charts and transition maps, but these are hard to work with and we would like an object from which we can easily compute whatever properties we need of the space. Some space of objects isomorphic to (or, in a pinch, surjective onto) the space of all (relevant) manifolds or graphs seems like the ideal choice, and if one can easily extract summaries, in some sense, even better.
What are Laplacians?
While not universal, for a large portion of the people working in this field, the object of choice is overwhelmingly the Laplacian, denoted \(\Delta\) or \(\nabla \cdot \nabla\) or \(\nabla^2\), a differential (or, for graphs, difference) operator over functions defined on the space. This is usually defined as the divergence of the gradient: for a Euclidean space \(\mathbb{R}^n\), this is simply the sum of the second partial derivatives in each direction \(\Delta=\sum_{i=1}^n \partial_i^2\). Over a graph, if D is the matrix with the number of edges of each node (in some fixed but arbitrary order) on the diagonal and A the adjacency matrix with a 1 in position (i,j) if there is an edge between i and j, and a 0 otherwise (including on the diagonal), the Laplacian is simply \(L:=D-A\). While in the graph case it is easy to see that this object contains all the information in the structure of the graph, the form of a linear operator, and why this particular linear operator, always seemed a bit mysterious to me. So, I thought it would be worthwhile to lay out my thoughts on different ways of looking at this object that might explain how it’s used and why it takes pride of place in structured data analysis.
One issue is that there are many different definitions of the Laplacian because there are many different definitions of a derivative, gradient, and divergence, depending on the space in which they act and other desiderata. Given some definition of a gradient, divergence is usually defined as the adjoint, with the inner product determined by the space, and so many cases are identical up to the choice of space and derivative. For example, the graph case corresponds to taking differences over connected vertices.- Over manifolds, where the derivative no longer can be represented as an element of the space, there are at least two common formulations:
- On Riemannian manifolds, the Laplace-Beltrami operator (or connection Laplacian) is defined in terms of the Levi-Civita connection, a covariant derivative in the direction of the tangent space, or equivalently as the trace of the Hessian (defined in terms of covariant derivatives).
- Using instead the exterior derivative \(\partial\) and codifferential \(\delta\), which act on the exterior algebra of differential forms of a space, one can construct the Laplace-De Rham operator (or Hodge Laplacian) on differential forms as \(\partial\delta +\delta\partial\) .
- If the graph Laplacian is seen as generalizing the Laplace-Beltrami operator to graphs, the combinatorial Laplacian can be seen as generalizing the Laplace-de Rham operator to simplicial complexes, higher-order structures over graphs, with exterior derivative and codifferential replaced by boundary and coboundary maps, respectively, with differences among faces connected by a coface or cofaces connected by a face.
- Malliavin calculus induces a gradient on stochastic processes, the Malliavin derivative, and a corresponding adjoint, the Skorohod integral, familiar in the case of adapted processes as the Itō integral. Successive application yields a stochastic analogue of the Laplacian, the Ornstein-Uhlenbeck operator, so named since it is the infinitesimal generator of a standard Ornstein-Uhlenbeck process, the “continuous AR(1)”
- For stochastic processes defined over more general manifolds, one may generalize the “usual” Malliavin derivative and OU operator analogously to how one generalizes the Euclidean derivative to manifolds, though apparently some difficulties arise.
- The Casimir operator plays the role of the Laplacian on a Lie algebra, and its eigenelements play a role in harmonic analysis of these spaces similar to the role played by eigenfunctions of the Laplacian in standard harmonic analysis.
- The hypoelliptic Laplacian does… something. Honestly I was hopelessly lost ten minutes into this presentation.
Applications: How the Laplacian is Used
While Laplacians can be described for many spaces, data usually takes the form of a collection of points which may or may not be known to lie in (or near) the space of interest, which often must be estimated. To determine this space or functions on it, many procedures start by estimating the Laplacian or some of its properties, in particular its eigenvalues and eigenfunctions, and possibly some functions constructed from these functions, and these objects are taken to define the space of interest.
The way this works is as follows: from a set of points, if they do not already have a known graph structure, one can construct a (possibly weighted) graph by connecting nearby points (in some metric, usually Euclidean). One then takes the graph Laplacian of this graph and calculates its eigenvectors. Under some standard conditions, if more and more points are drawn from a compact submanifold in Euclidean space, the graph Laplacian converges to the Laplace-Beltrami operator of the manifold and the eigenvectors converge to its eigenfunctions. The proofs of this are remarkably straightforward, essentially copying convergence results for kernel density estimators.
When the manifold to be considered has intrinsic dimension smaller than that of the ambient space, it can be well described by only the first few eigenvectors, which define a coordinate system over the manifold. For the linear case, this is the principle of multi-dimensional scaling; in the nonlinear case, it is the idea behind Laplacian Eigenmaps and Diffusion Maps as manifold estimation methods. In defining a coordinate system on the manifold or graph, these vectors can also be used to parameterize distance in the data set, and so give a sense of what parts of the data are closely connected.
One of the most popular uses of this measure is to detect related communities or clusters in a data set. Spectral clustering takes the eigenvector corresponding to the second-smallest eigenvalue of the (possibly weighted or perturbed, depending on the estimation method) symmetrized graph Laplacian \(D^{-1/2} L D^{-1/2}\) and partitions it (via any standard clustering algorithm, usually k-means). If the graph has multiple connected components, this recovers them: there are also consistency guarantees for recovering communities which are not completely disconnected, such as the under the stochastic block model, or a variety of clustering measures.
Distance measures also give a definition of local which is intrinsic to the structure of the data and possibly lower dimensional than the ambient space, which makes them useful to describe what it means for a function on the data to be continuous or smooth: it should not vary too much over small distances. Since the Laplacian is a differential operator, it acts on differentiable functions, and its eigenfunctions are differentiable and can be taken to define what it means to be smooth over a graph or manifold. The sequence of eigenfunctions then provides a Fourier-type basis which can be used to approximate arbitrary smooth functions over the space on which the data concentrates. One may also decompose these functions into local components at multiple scales to generate versions of wavelet analysis. On a known manifold, such a construction generates needlets, on a fixed graph one can produce graph wavelets, and using the estimates of the manifold provided by diffusion maps one can generate diffusion wavelets. These can then be used like conventional wavelets for smooth function and density estimation and compression.
So why does it work?
One reason the Laplacian appears to be useful for data analysis is that it appears to be useful for tons of things: it is a core object in a number of mathematical subjects and captures the relationship between them. Each perspective provides a different justification for the Laplacian, and also suggests directions for generalization or improvement.
- Functional Analytic
One way to see the Laplacian is that it is a particular example of a linear operator on a function space, with some nice properties (it is self adjoint) and some more peculiar properties (it is unbounded, acting continuously only on twice-differentiable functions). This small domain might also be an advantage, since it defines a subset of the class of all functions which is small4 but dense. This space, in the Euclidean case, is a classical Sobolev space. More usefully, the spectral decomposition of this operator, by self-adjointness, generates an orthonormal eigenbasis which provides basis functions which can provide uniformly accurate finite approximation of functions in the space. In the 1-dimensional periodic case this is the Fourier basis, and the eigenvalues order the basis functions by frequency. This can be seen by noting that the second derivative of a sine or cosine is again a sine or cosine, up to sign and scale. This fact can be taken as a definition of the Fourier transform, and provides one means of generalizing the transform to more general spaces like graphs or manifolds.5 Given the huge set of things one can do with a Fourier transform (filtering, wavelet analysis, sieve estimation, fast algorithms), the utility here is obvious.
The question which this approach raises is what is special about the Laplacian, as opposed to some other operator which could generate an ordered orthonormal basis of eigenfunctions. Here I don’t have the greatest intuition: certainly the use of a differential operator encodes smoothness in a way which seems natural, with the idea that local changes are not too large meaning that the second derivative is not excessively large (on average over the space).
Another heuristic justification for this kind of choice is that since the derivative gives local information, it is well-suited to applications based on data, since one will never obtain the value of a function except at a finite number of points, so if one wants global information, it is necessary to impose a structure which allows interpolating between points. However, this principle at best gives that a differential structure is sufficient, rather than necessary: ability to interpolate or estimate functions from points is controlled more by the size of the function space (according to some complexity measure) than smoothness per se, though the function class must be able to eventually distinguish functions based on evaluation at a finite but growing number of points.
One aspect of this space beyond low complexity which could be useful for data analysis is that it forms a reproducing kernel Hilbert space. In principle this indicates that estimation, interpolation, and inverse problem solution can take advantage of the explosion of computationally convenient procedures for spaces of this type based on Kimeldorf and Wahba’s representer theorem and Vapnik’s kernel trick that form the key ideas behind smoothing splines, support vector machines, and related kernel methods. In practice, these tools are mainly useful if one can easily calculate the reproducing kernel, which is why practitioners tend to rely on ad hoc choices like Gaussian, exponential, or polynomial kernels rather than the kernel induced by the Sobolev space. While it is possible to find this object for a specific space by solving a set of PDEs (Wahba does this for classical Sobolev spaces on the interval and on the sphere in her classic book on splines), it is again easier to apply methods which induce an easily computable kernel function, corresponding to tractable deformed versions of the Laplacian.
One property of the Fourier transform which is not in general preserved, at least for inhomogeneous spaces, is translation invariance. This is what makes Fourier analysis useful in the time-series context, preserving the features of a stationary series and allowing consistent inference from a single observed series. It might reasonably argued, though, that if the space does not have homogeneity properties of some sort, that one should not expect to be able to extrapolate outside of the range of observation, at least not without some kind of known or locally learnable global structure.6 Since the eigenfunctions of the Laplacian are defined globally, they do impose some structure on functions over areas where data is not observed, including places which are structurally different from the area of observation. On the other hand, since they are defined in terms of smoothness, these global features are less informative far away from the area of observation, and I suspect that proper uncertainty quantification would therefore reflect higher uncertainty in these regions. This is likely to be particularly true of wavelet methods, which have substantially better localization properties.