No Arabic abstract
We introduce the concept of Hypoelliptic Diffusion Maps (HDM), a framework generalizing Diffusion Maps in the context of manifold learning and dimensionality reduction. Standard non-linear dimensionality reduction methods (e.g., LLE, ISOMAP, Laplacian Eigenmaps, Diffusion Maps) focus on mining massive data sets using weighted affinity graphs; Orientable Diffusion Maps and Vector Diffusion Maps enrich these graphs by attaching to each node also some local geometry. HDM likewise considers a scenario where each node possesses additional structure, which is now itself of interest to investigate. Virtually, HDM augments the original data set with attached structures, and provides tools for studying and organizing the augmented ensemble. The goal is to obtain information on individual structures attached to the nodes and on the relationship between structures attached to nearby nodes, so as to study the underlying manifold from which the nodes are sampled. In this paper, we analyze HDM on tangent bundles, revealing its intimate connection with sub-Riemannian geometry and a family of hypoelliptic differential operators. In a later paper, we shall consider more general fibre bundles.
Kernel-based non-linear dimensionality reduction methods, such as Local Linear Embedding (LLE) and Laplacian Eigenmaps, rely heavily upon pairwise distances or similarity scores, with which one can construct and study a weighted graph associated with the dataset. When each individual data object carries additional structural details, however, the correspondence relations between these structures provide extra information that can be leveraged for studying the dataset using the graph. Based on this observation, we generalize Diffusion Maps (DM) in manifold learning and introduce the framework of Horizontal Diffusion Maps (HDM). We model a dataset with pairwise structural correspondences as a fibre bundle equipped with a connection. We demonstrate the advantage of incorporating such additional information and study the asymptotic behavior of HDM on general fibre bundles. In a broader context, HDM reveals the sub-Riemannian structure of high-dimensional datasets, and provides a nonparametric learning framework for datasets with structural correspondences.
Diffusion maps is a manifold learning algorithm widely used for dimensionality reduction. Using a sample from a distribution, it approximates the eigenvalues and eigenfunctions of associated Laplace-Beltrami operators. Theoretical bounds on the approximation error are however generally much weaker than the rates that are seen in practice. This paper uses new approaches to improve the error bounds in the model case where the distribution is supported on a hypertorus. For the data sampling (variance) component of the error we make spatially localised compact embedding estimates on certain Hardy spaces; we study the deterministic (bias) component as a perturbation of the Laplace-Beltrami operators associated PDE, and apply relevant spectral stability results. Using these approaches, we match long-standing pointwise error bounds for both the spectral data and the norm convergence of the operator discretisation. We also introduce an alternative normalisation for diffusion maps based on Sinkhorn weights. This normalisation approximates a Langevin diffusion on the sample and yields a symmetric operator approximation. We prove that it has better convergence compared with the standard normalisation on flat domains, and present a highly efficient algorithm to compute the Sinkhorn weights.
We define and study natural $mathrm{SU}(2)$-structures, in the sense of Conti-Salamon, on the total space $cal S$ of the tangent sphere bundle of any given oriented Riemannian 3-manifold $M$. We recur to a fundamental exterior differential system of Riemannian geometry. Essentially, two types of structures arise: the contact-hypo and the non-contact and, for each, we study the conditions for being hypo, nearly-hypo or double-hypo. We discover new double-hypo structures on $S^3times S^2$, of which the well-known Sasaki-Einstein are a particular case. Hyperbolic geometry examples also appear. In the search of the associated metrics, we find a theorem, useful for explicitly determining the metric, which applies to all $mathrm{SU}(2)$-structures in general. Within our application to tangent sphere bundles, we discover a whole new class of metrics specific to 3d-geometry. The evolution equations of Conti-Salamon are considered; leading to a new integrable $mathrm{SU}(3)$-structure on ${cal S}timesmathbb{R}_+$ associated to any flat $M$.
Let $(X,Y)$ be a random variable consisting of an observed feature vector $Xin mathcal{X}$ and an unobserved class label $Yin {1,2,...,L}$ with unknown joint distribution. In addition, let $mathcal{D}$ be a training data set consisting of $n$ completely observed independent copies of $(X,Y)$. Usual classification procedures provide point predictors (classifiers) $widehat{Y}(X,mathcal{D})$ of $Y$ or estimate the conditional distribution of $Y$ given $X$. In order to quantify the certainty of classifying $X$ we propose to construct for each $theta =1,2,...,L$ a p-value $pi_{theta}(X,mathcal{D})$ for the null hypothesis that $Y=theta$, treating $Y$ temporarily as a fixed parameter. In other words, the point predictor $widehat{Y}(X,mathcal{D})$ is replaced with a prediction region for $Y$ with a certain confidence. We argue that (i) this approach is advantageous over traditional approaches and (ii) any reasonable classifier can be modified to yield nonparametric p-values. We discuss issues such as optimality, single use and multiple use validity, as well as computational and graphical aspects.
The lasso and related sparsity inducing algorithms have been the target of substantial theoretical and applied research. Correspondingly, many results are known about their behavior for a fixed or optimally chosen tuning parameter specified up to unknown constants. In practice, however, this oracle tuning parameter is inaccessible so one must use the data to select one. Common statistical practice is to use a variant of cross-validation for this task. However, little is known about the theoretical properties of the resulting predictions with such data-dependent methods. We consider the high-dimensional setting with random design wherein the number of predictors $p$ grows with the number of observations $n$. Under typical assumptions on the data generating process, similar to those in the literature, we recover oracle rates up to a log factor when choosing the tuning parameter with cross-validation. Under weaker conditions, when the true model is not necessarily linear, we show that the lasso remains risk consistent relative to its linear oracle. We also generalize these results to the group lasso and square-root lasso and investigate the predictive and model selection performance of cross-validation via simulation.