Balancing Geometry and Density: Path Distances on High-Dimensional Data

102 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Daniel McKenzie

تاريخ النشر 2020

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Anna Little - Daniel McKenzie - James Murphy

التعلم الالي بنى وهياكل البيانات والخوارزميات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

New geometric and computational analyses of power-weighted shortest-path distances (PWSPDs) are presented. By illuminating the way these metrics balance density and geometry in the underlying data, we clarify their key parameters and discuss how they may be chosen in practice. Comparisons are made with related data-driven metrics, which illustrate the broader role of density in kernel-based unsupervised and semi-supervised machine learning. Computationally, we relate PWSPDs on complete weighted graphs to their analogues on weighted nearest neighbor graphs, providing high probability guarantees on their equivalence that are near-optimal. Connections with percolation theory are developed to establish estimates on the bias and variance of PWSPDs in the finite sample setting. The theoretical results are bolstered by illustrative experiments, demonstrating the versatility of PWSPDs for a wide range of data settings. Throughout the paper, our results require only that the underlying data is sampled from a low-dimensional manifold, and depend crucially on the intrinsic dimension of this manifold, rather than its ambient dimension.

قيم البحث

73 - Sandhya Saisubramanian , Sainyam Galhotra , Shlomo Zilberstein 2019

Graph clustering groups entities -- the vertices of a graph -- based on their similarity, typically using a complex distance function over a large number of features. Successful integration of clustering approaches in automated decision-support syste ms hinges on the interpretability of the resulting clusters. This paper addresses the problem of generating interpretable clusters, given features of interest that signify interpretability to an end-user, by optimizing interpretability in addition to common clustering objectives. We propose a $beta$-interpretable clustering algorithm that ensures that at least $beta$ fraction of nodes in each cluster share the same feature value. The tunable parameter $beta$ is user-specified. We also present a more efficient algorithm for scenarios with $beta!=!1$ and analyze the theoretical guarantees of the two algorithms. Finally, we empirically demonstrate the benefits of our approaches in generating interpretable clusters using four real-world datasets. The interpretability of the clusters is complemented by generating simple explanations denoting the feature values of the nodes in the clusters, using frequent pattern mining.

التعلم الالي بنى وهياكل البيانات والخوارزميات التعلم الآلي

On Learning High Dimensional Structured Single Index Models

106 - Nikhil Rao , Ravi Ganti , Laura Balzano 2016

Single Index Models (SIMs) are simple yet flexible semi-parametric models for machine learning, where the response variable is modeled as a monotonic function of a linear combination of features. Estimation in this context requires learning both the feature weights and the nonlinear function that relates features to observations. While methods have been described to learn SIMs in the low dimensional regime, a method that can efficiently learn SIMs in high dimensions, and under general structural assumptions, has not been forthcoming. In this paper, we propose computationally efficient algorithms for SIM inference in high dimensions with structural constraints. Our general approach specializes to sparsity, group sparsity, and low-rank assumptions among others. Experiments show that the proposed method enjoys superior predictive performance when compared to generalized linear models, and achieves results comparable to or better than single layer feedforward neural networks with significantly less computational cost.

التعلم الالي الذكاء الاصطناعي التعلم الآلي

Post-Processing of High-Dimensional Data

178 - Mike Espig , Wolfgang Hackbusch , Alexander Litvinenko 2019

Scientific computations or measurements may result in huge volumes of data. Often these can be thought of representing a real-valued function on a high-dimensional domain, and can be conceptually arranged in the format of a tensor of high degree in s ome truncated or lossy compressed format. We look at some common post-processing tasks which are not obvious in the compressed format, as such huge data sets can not be stored in their entirety, and the value of an element is not readily accessible through simple look-up. The tasks we consider are finding the location of maximum or minimum, or minimum and maximum of a function of the data, or finding the indices of all elements in some interval --- i.e. level sets, the number of elements with a value in such a level set, the probability of an element being in a particular level set, and the mean and variance of the total collection. The algorithms to be described are fixed point iterations of particular functions of the tensor, which will then exhibit the desired result. For this, the data is considered as an element of a high degree tensor space, although in an abstract sense, the algorithms are independent of the representation of the data as a tensor. All that we require is that the data can be considered as an element of an associative, commutative algebra with an inner product. Such an algebra is isomorphic to a commutative sub-algebra of the usual matrix algebra, allowing the use of matrix algorithms to accomplish the mentioned tasks. We allow the actual computational representation to be a lossy compression, and we allow the algebra operations to be performed in an approximate fashion, so as to maintain a high compression level. One such example which we address explicitly is the representation of data as a tensor with compression in the form of a low-rank representation.

التحليل العددي بنى وهياكل البيانات والخوارزميات التحليل العددي

High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm

178 - Wenlong Mou , Yi-An Ma , Martin J. Wainwright 2019

We propose a Markov chain Monte Carlo (MCMC) algorithm based on third-order Langevin dynamics for sampling from distributions with log-concave and smooth densities. The higher-order dynamics allow for more flexible discretization schemes, and we deve lop a specific method that combines splitting with more accurate integration. For a broad class of $d$-dimensional distributions arising from generalized linear models, we prove that the resulting third-order algorithm produces samples from a distribution that is at most $varepsilon > 0$ in Wasserstein distance from the target distribution in $Oleft(frac{d^{1/4}}{ varepsilon^{1/2}} right)$ steps. This result requires only Lipschitz conditions on the gradient. For general strongly convex potentials with $alpha$-th order smoothness, we prove that the mixing time scales as $O left(frac{d^{1/4}}{varepsilon^{1/2}} + frac{d^{1/2}}{varepsilon^{1/(alpha - 1)}} right)$.

التعلم الالي بنى وهياكل البيانات والخوارزميات التعلم الآلي

On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity

195 - Khang Le , Huy Nguyen , Tung Pham 2021

We study the multi-marginal partial optimal transport (POT) problem between $m$ discrete (unbalanced) measures with at most $n$ supports. We first prove that we can obtain two equivalence forms of the multimarginal POT problem in terms of the multima rginal optimal transport problem via novel extensions of cost tensor. The first equivalence form is derived under the assumptions that the total masses of each measure are sufficiently close while the second equivalence form does not require any conditions on these masses but at the price of more sophisticated extended cost tensor. Our proof techniques for obtaining these equivalence forms rely on novel procedures of moving mass in graph theory to push transportation plan into appropriate regions. Finally, based on the equivalence forms, we develop optimization algorithm, named ApproxMPOT algorithm, that builds upon the Sinkhorn algorithm for solving the entropic regularized multimarginal optimal transport. We demonstrate that the ApproxMPOT algorithm can approximate the optimal value of multimarginal POT problem with a computational complexity upper bound of the order $tilde{mathcal{O}}(m^3(n+1)^{m}/ varepsilon^2)$ where $varepsilon > 0$ stands for the desired tolerance.

التعلم الالي بنى وهياكل البيانات والخوارزميات التعلم الآلي