No Arabic abstract
Modifications to a neural networks input and output layers are often required to accommodate the specificities of most practical learning tasks. However, the impact of such changes on architectures approximation capabilities is largely not understood. We present general conditions describing feature and readout maps that preserve an architectures ability to approximate any continuous functions uniformly on compacts. As an application, we show that if an architecture is capable of universal approximation, then modifying its final layer to produce binary values creates a new architecture capable of deterministically approximating any classifier. In particular, we obtain guarantees for deep CNNs and deep feed-forward networks. Our results also have consequences within the scope of geometric deep learning. Specifically, when the input and output spaces are Cartan-Hadamard manifolds, we obtain geometrically meaningful feature and readout maps satisfying our criteria. Consequently, commonly used non-Euclidean regression models between spaces of symmetric positive definite matrices are extended to universal DNNs. The same result allows us to show that the hyperbolic feed-forward networks, used for hierarchical learning, are universal. Our result is also used to show that the common practice of randomizing all but the last two layers of a DNN produces a universal family of functions with probability one. We also provide conditions on a DNNs first (resp. last) few layers connections and activation function which guarantee that these layers can have a width equal to the input (resp. output) spaces dimension while not negatively affecting the architectures approximation capabilities.
Most $L^p$-type universal approximation theorems guarantee that a given machine learning model class $mathscr{F}subseteq C(mathbb{R}^d,mathbb{R}^D)$ is dense in $L^p_{mu}(mathbb{R}^d,mathbb{R}^D)$ for any suitable finite Borel measure $mu$ on $mathbb{R}^d$. Unfortunately, this means that the models approximation quality can rapidly degenerate outside some compact subset of $mathbb{R}^d$, as any such measure is largely concentrated on some bounded subset of $mathbb{R}^d$. This paper proposes a generic solution to this approximation theoretic problem by introducing a canonical transformation which upgrades $mathscr{F}$s approximation property in the following sense. The transformed model class, denoted by $mathscr{F}text{-tope}$, is shown to be dense in $L^p_{mu,text{strict}}(mathbb{R}^d,mathbb{R}^D)$ which is a topological space whose elements are locally $p$-integrable functions and whose topology is much finer than usual norm topology on $L^p_{mu}(mathbb{R}^d,mathbb{R}^D)$; here $mu$ is any suitable $sigma$-finite Borel measure $mu$ on $mathbb{R}^d$. Next, we show that if $mathscr{F}$ is any family of analytic functions then there is always a strict gap between $mathscr{F}text{-tope}$s expressibility and that of $mathscr{F}$, since we find that $mathscr{F}$ can never dense in $L^p_{mu,text{strict}}(mathbb{R}^d,mathbb{R}^D)$. In the general case, where $mathscr{F}$ may contain non-analytic functions, we provide an abstract form of these results guaranteeing that there always exists some function space in which $mathscr{F}text{-tope}$ is dense but $mathscr{F}$ is not, while, the converse is never possible. Applications to feedforward networks, convolutional neural networks, and polynomial bases are explored.
We introduce a general framework for approximating regular conditional distributions (RCDs). Our approximations of these RCDs are implemented by a new class of geometric deep learning models with inputs in $mathbb{R}^d$ and outputs in the Wasserstein-$1$ space $mathcal{P}_1(mathbb{R}^D)$. We find that the models built using our framework can approximate any continuous functions from $mathbb{R}^d$ to $mathcal{P}_1(mathbb{R}^D)$ uniformly on compacts, and quantitative rates are obtained. We identify two methods for avoiding the curse of dimensionality; i.e.: the number of parameters determining the approximating neural network depends only polynomially on the involved dimension and the approximation error. The first solution describes functions in $C(mathbb{R}^d,mathcal{P}_1(mathbb{R}^D))$ which can be efficiently approximated on any compact subset of $mathbb{R}^d$. Conversely, the second approach describes sets in $mathbb{R}^d$, on which any function in $C(mathbb{R}^d,mathcal{P}_1(mathbb{R}^D))$ can be efficiently approximated. Our framework is used to obtain an affirmative answer to the open conjecture of Bishop (1994); namely: mixture density networks are universal regular conditional distributions. The predictive performance of the proposed models is evaluated against comparable learning models on various probabilistic predictions tasks in the context of ELMs, model uncertainty, and heteroscedastic regression. All the results are obtained for more general input and output spaces and thus apply to geometric deep learning contexts.
We prove two universal approximation theorems for a range of dropout neural networks. These are feed-forward neural networks in which each edge is given a random ${0,1}$-valued filter, that have two modes of operation: in the first each edge output is multiplied by its random filter, resulting in a random output, while in the second each edge output is multiplied by the expectation of its filter, leading to a deterministic output. It is common to use the random mode during training and the deterministic mode during testing and prediction. Both theorems are of the following form: Given a function to approximate and a threshold $varepsilon>0$, there exists a dropout network that is $varepsilon$-close in probability and in $L^q$. The first theorem applies to dropout networks in the random mode. It assumes little on the activation function, applies to a wide class of networks, and can even be applied to approximation schemes other than neural networks. The core is an algebraic property that shows that deterministic networks can be exactly matched in expectation by random networks. The second theorem makes stronger assumptions and gives a stronger result. Given a function to approximate, it provides existence of a network that approximates in both modes simultaneously. Proof components are a recursive replacement of edges by independent copies, and a special first-layer replacement that couples the resulting larger network to the input. The functions to be approximated are assumed to be elements of general normed spaces, and the approximations are measured in the corresponding norms. The networks are constructed explicitly. Because of the different methods of proof, the two results give independent insight into the approximation properties of random dropout networks. With this, we establish that dropout neural networks broadly satisfy a universal-approximation property.
Modelling functions of sets, or equivalently, permutation-invariant functions, is a long-standing challenge in machine learning. Deep Sets is a popular method which is known to be a universal approximator for continuous set functions. We provide a theoretical analysis of Deep Sets which shows that this universal approximation property is only guaranteed if the models latent space is sufficiently high-dimensional. If the latent space is even one dimension lower than necessary, there exist piecewise-affine functions for which Deep Sets performs no better than a naive constant baseline, as judged by worst-case error. Deep Sets may be viewed as the most efficient incarnation of the Janossy pooling paradigm. We identify this paradigm as encompassing most currently popular set-learning methods. Based on this connection, we discuss the implications of our results for set learning more broadly, and identify some open questions on the universality of Janossy pooling in general.
We propose an efficient algorithm to visualise symmetries in neural networks. Typically, models are defined with respect to a parameter space, where non-equal parameters can produce the same input-output map. Our proposed method, GENNI, allows us to efficiently identify parameters that are functionally equivalent and then visualise the subspace of the resulting equivalence class. By doing so, we are now able to better explore questions surrounding identifiability, with applications to optimisation and generalizability, for commonly used or newly developed neural network architectures.