Universal Regular Conditional Distributions

70 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Anastasis Kratsios

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Anastasis Kratsios

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We introduce a general framework for approximating regular conditional distributions (RCDs). Our approximations of these RCDs are implemented by a new class of geometric deep learning models with inputs in $mathbb{R}^d$ and outputs in the Wasserstein-$1$ space $mathcal{P}_1(mathbb{R}^D)$. We find that the models built using our framework can approximate any continuous functions from $mathbb{R}^d$ to $mathcal{P}_1(mathbb{R}^D)$ uniformly on compacts, and quantitative rates are obtained. We identify two methods for avoiding the curse of dimensionality; i.e.: the number of parameters determining the approximating neural network depends only polynomially on the involved dimension and the approximation error. The first solution describes functions in $C(mathbb{R}^d,mathcal{P}_1(mathbb{R}^D))$ which can be efficiently approximated on any compact subset of $mathbb{R}^d$. Conversely, the second approach describes sets in $mathbb{R}^d$, on which any function in $C(mathbb{R}^d,mathcal{P}_1(mathbb{R}^D))$ can be efficiently approximated. Our framework is used to obtain an affirmative answer to the open conjecture of Bishop (1994); namely: mixture density networks are universal regular conditional distributions. The predictive performance of the proposed models is evaluated against comparable learning models on various probabilistic predictions tasks in the context of ELMs, model uncertainty, and heteroscedastic regression. All the results are obtained for more general input and output spaces and thus apply to geometric deep learning contexts.

قيم البحث

302 - Ryan R. Strauss , Junier B. Oliva 2021

Modeling distributions of covariates, or density estimation, is a core challenge in unsupervised learning. However, the majority of work only considers the joint distribution, which has limited utility in practical situations. A more general and usef ul problem is arbitrary conditional density estimation, which aims to model any possible conditional distribution over a set of covariates, reflecting the more realistic setting of inference based on prior knowledge. We propose a novel method, Arbitrary Conditioning with Energy (ACE), that can simultaneously estimate the distribution $p(mathbf{x}_u mid mathbf{x}_o)$ for all possible subsets of unobserved features $mathbf{x}_u$ and observed features $mathbf{x}_o$. ACE is designed to avoid unnecessary bias and complexity -- we specify densities with a highly expressive energy function and reduce the problem to only learning one-dimensional conditionals (from which more complex distributions can be recovered during inference). This results in an approach that is both simpler and higher-performing than prior methods. We show that ACE achieves state-of-the-art for arbitrary conditional likelihood estimation and data imputation on standard benchmarks.

التعلم الآلي

Non-Euclidean Universal Approximation

116 - Anastasis Kratsios , Eugene Bilokopytov 2020

Modifications to a neural networks input and output layers are often required to accommodate the specificities of most practical learning tasks. However, the impact of such changes on architectures approximation capabilities is largely not understood . We present general conditions describing feature and readout maps that preserve an architectures ability to approximate any continuous functions uniformly on compacts. As an application, we show that if an architecture is capable of universal approximation, then modifying its final layer to produce binary values creates a new architecture capable of deterministically approximating any classifier. In particular, we obtain guarantees for deep CNNs and deep feed-forward networks. Our results also have consequences within the scope of geometric deep learning. Specifically, when the input and output spaces are Cartan-Hadamard manifolds, we obtain geometrically meaningful feature and readout maps satisfying our criteria. Consequently, commonly used non-Euclidean regression models between spaces of symmetric positive definite matrices are extended to universal DNNs. The same result allows us to show that the hyperbolic feed-forward networks, used for hierarchical learning, are universal. Our result is also used to show that the common practice of randomizing all but the last two layers of a DNN produces a universal family of functions with probability one. We also provide conditions on a DNNs first (resp. last) few layers connections and activation function which guarantee that these layers can have a width equal to the input (resp. output) spaces dimension while not negatively affecting the architectures approximation capabilities.

التعلم الآلي الحوسبة العصبية والتطورية الهندسة التفاضلية

A Universal Kernel for Learning Regular Languages

631 - Leonid Kontorovich 2007

We give a universal kernel that renders all the regular languages linearly separable. We are not able to compute this kernel efficiently and conjecture that it is intractable, but we do have an efficient $eps$-approximation.

التعلم الآلي الرياضيات المتقطعة

Learning from Conditional Distributions via Dual Embeddings

161 - Bo Dai , Niao He , Yunpeng Pan 2016

Many machine learning tasks, such as learning with invariance and policy evaluation in reinforcement learning, can be characterized as problems of learning from conditional distributions. In such problems, each sample $x$ itself is associated with a conditional distribution $p(z|x)$ represented by samples ${z_i}_{i=1}^M$, and the goal is to learn a function $f$ that links these conditional distributions to target values $y$. These learning problems become very challenging when we only have limited samples or in the extreme case only one sample from each conditional distribution. Commonly used approaches either assume that $z$ is independent of $x$, or require an overwhelmingly large samples from each conditional distribution. To address these challenges, we propose a novel approach which employs a new min-max reformulation of the learning from conditional distribution problem. With such new reformulation, we only need to deal with the joint distribution $p(z,x)$. We also design an efficient learning algorithm, Embedding-SGD, and establish theoretical sample complexity for such problems. Finally, our numerical experiments on both synthetic and real-world datasets show that the proposed approach can significantly improve over the existing algorithms.

التعلم الآلي التحسين والتحكم التعلم الالي

DARB: A Density-Aware Regular-Block Pruning for Deep Neural Networks

144 - Ao Ren , Tao Zhang , Yuhao Wang 2019

The rapidly growing parameter volume of deep neural networks (DNNs) hinders the artificial intelligence applications on resource constrained devices, such as mobile and wearable devices. Neural network pruning, as one of the mainstream model compress ion techniques, is under extensive study to reduce the number of parameters and computations. In contrast to irregular pruning that incurs high index storage and decoding overhead, structured pruning techniques have been proposed as the promising solutions. However, prior studies on structured pruning tackle the problem mainly from the perspective of facilitating hardware implementation, without analyzing the characteristics of sparse neural networks. The neglect on the study of sparse neural networks causes inefficient trade-off between regularity and pruning ratio. Consequently, the potential of structurally pruning neural networks is not sufficiently mined. In this work, we examine the structural characteristics of the irregularly pruned weight matrices, such as the diverse redundancy of different rows, the sensitivity of different rows to pruning, and the positional characteristics of retained weights. By leveraging the gained insights as a guidance, we first propose the novel block-max weight masking (BMWM) method, which can effectively retain the salient weights while imposing high regularity to the weight matrix. As a further optimization, we propose a density-adaptive regular-block (DARB) pruning that outperforms prior structured pruning work with high pruning ratio and decoding efficiency. Our experimental results show that DARB can achieve 13$times$ to 25$times$ pruning ratio, which are 2.8$times$ to 4.3$times$ improvements than the state-of-the-art counterparts on multiple neural network models and tasks. Moreover, DARB can achieve 14.3$times$ decoding efficiency than block pruning with higher pruning ratio.

التعلم الآلي الحوسبة العصبية والتطورية