No Arabic abstract
An associative memory is a framework of content-addressable memory that stores a collection of message vectors (or a dataset) over a neural network while enabling a neurally feasible mechanism to recover any message in the dataset from its noisy version. Designing an associative memory requires addressing two main tasks: 1) learning phase: given a dataset, learn a concise representation of the dataset in the form of a graphical model (or a neural network), 2) recall phase: given a noisy version of a message vector from the dataset, output the correct message vector via a neurally feasible algorithm over the network learnt during the learning phase. This paper studies the problem of designing a class of neural associative memories which learns a network representation for a large dataset that ensures correction against a large number of adversarial errors during the recall phase. Specifically, the associative memories designed in this paper can store dataset containing $exp(n)$ $n$-length message vectors over a network with $O(n)$ nodes and can tolerate $Omega(frac{n}{{rm polylog} n})$ adversarial errors. This paper carries out this memory design by mapping the learning phase and recall phase to the tasks of dictionary learning with a square dictionary and iterative error correction in an expander code, respectively.
Humans can quickly associate stimuli to solve problems in novel contexts. Our novel neural network model learns state representations of facts that can be composed to perform such associative inference. To this end, we augment the LSTM model with an associative memory, dubbed Fast Weight Memory (FWM). Through differentiable operations at every step of a given input sequence, the LSTM updates and maintains compositional associations stored in the rapidly changing FWM weights. Our model is trained end-to-end by gradient descent and yields excellent performance on compositional language reasoning problems, meta-reinforcement-learning for POMDPs, and small-scale word-level language modelling.
As artificial intelligence is increasingly affecting all parts of society and life, there is growing recognition that human interpretability of machine learning models is important. It is often argued that accuracy or other similar generalization performance metrics must be sacrificed in order to gain interpretability. Such arguments, however, fail to acknowledge that the overall decision-making system is composed of two entities: the learned model and a human who fuses together model outputs with his or her own information. As such, the relevant performance criteria should be for the entire system, not just for the machine learning component. In this work, we characterize the performance of such two-node tandem data fusion systems using the theory of distributed detection. In doing so, we work in the population setting and model interpretable learned models as multi-level quantizers. We prove that under our abstraction, the overall system of a human with an interpretable classifier outperforms one with a black box classifier.
This paper introduces a new nonlinear dictionary learning method for histograms in the probability simplex. The method leverages optimal transport theory, in the sense that our aim is to reconstruct histograms using so-called displacement interpolations (a.k.a. Wasserstein barycenters) between dictionary atoms; such atoms are themselves synthetic histograms in the probability simplex. Our method simultaneously estimates such atoms, and, for each datapoint, the vector of weights that can optimally reconstruct it as an optimal transport barycenter of such atoms. Our method is computationally tractable thanks to the addition of an entropic regularization to the usual optimal transportation problem, leading to an approximation scheme that is efficient, parallel and simple to differentiate. Both atoms and weights are learned using a gradient-based descent method. Gradients are obtained by automatic differentiation of the generalized Sinkhorn iterations that yield barycenters with entropic smoothing. Because of its formulation relying on Wasserstein barycenters instead of the usual matrix product between dictionary and codes, our method allows for nonlinear relationships between atoms and the reconstruction of input data. We illustrate its application in several different image processing settings.
In this paper, we present a sparse neural network decoder (SNND) of polar codes based on belief propagation (BP) and deep learning. At first, the conventional factor graph of polar BP decoding is converted to the bipartite Tanner graph similar to low-density parity-check (LDPC) codes. Then the Tanner graph is unfolded and translated into the graphical representation of deep neural network (DNN). The complex sum-product algorithm (SPA) is modified to min-sum (MS) approximation with low complexity. We dramatically reduce the number of weight by using single weight to parameterize the networks. Optimized by the training techniques of deep learning, proposed SNND achieves comparative decoding performance of SPA and obtains about $0.5$ dB gain over MS decoding on ($128,64$) and ($256,128$) codes. Moreover, $60 %$ complexity reduction is achieved and the decoding latency is significantly lower than the conventional polar BP.
Rectified linear units, or ReLUs, have become the preferred activation function for artificial neural networks. In this paper we consider two basic learning problems assuming that the underlying data follow a generative model based on a ReLU-network -- a neural network with ReLU activations. As a primarily theoretical study, we limit ourselves to a single-layer network. The first problem we study corresponds to dictionary-learning in the presence of nonlinearity (modeled by the ReLU functions). Given a set of observation vectors $mathbf{y}^i in mathbb{R}^d, i =1, 2, dots , n$, we aim to recover $dtimes k$ matrix $A$ and the latent vectors ${mathbf{c}^i} subset mathbb{R}^k$ under the model $mathbf{y}^i = mathrm{ReLU}(Amathbf{c}^i +mathbf{b})$, where $mathbf{b}in mathbb{R}^d$ is a random bias. We show that it is possible to recover the column space of $A$ within an error of $O(d)$ (in Frobenius norm) under certain conditions on the probability distribution of $mathbf{b}$. The second problem we consider is that of robust recovery of the signal in the presence of outliers, i.e., large but sparse noise. In this setting we are interested in recovering the latent vector $mathbf{c}$ from its noisy nonlinear sketches of the form $mathbf{v} = mathrm{ReLU}(Amathbf{c}) + mathbf{e}+mathbf{w}$, where $mathbf{e} in mathbb{R}^d$ denotes the outliers with sparsity $s$ and $mathbf{w} in mathbb{R}^d$ denote the dense but small noise. This line of work has recently been studied (Soltanolkotabi, 2017) without the presence of outliers. For this problem, we show that a generalized LASSO algorithm is able to recover the signal $mathbf{c} in mathbb{R}^k$ within an $ell_2$ error of $O(sqrt{frac{(k+s)log d}{d}})$ when $A$ is a random Gaussian matrix.