No Arabic abstract
In the dictionary learning (or sparse coding) problem, we are given a collection of signals (vectors in $mathbb{R}^d$), and the goal is to find a basis in which the signals have a sparse (approximate) representation. The problem has received a lot of attention in signal processing, learning, and theoretical computer science. The problem is formalized as factorizing a matrix $X (d times n)$ (whose columns are the signals) as $X = AY$, where $A$ has a prescribed number $m$ of columns (typically $m ll n$), and $Y$ has columns that are $k$-sparse (typically $k ll d$). Most of the known theoretical results involve assuming that the columns of the unknown $A$ have certain incoherence properties, and that the coefficient matrix $Y$ has random (or partly random) structure. The goal of our work is to understand what can be said in the absence of such assumptions. Can we still find $A$ and $Y$ such that $X approx AY$? We show that this is possible, if we allow violating the bounds on $m$ and $k$ by appropriate factors that depend on $k$ and the desired approximation. Our results rely on an algorithm for what we call the threshold correlation problem, which turns out to be related to hypercontractive norms of matrices. We also show that our algorithmic ideas apply to a setting in which some of the columns of $X$ are outliers, thus giving similar guarantees even in this challenging setting.
In this paper, we develop a parameter estimation method for factorially parametrized models such as Factorial Gaussian Mixture Model and Factorial Hidden Markov Model. Our contributions are two-fold. First, we show that the emission matrix of the standard Factorial Model is unidentifiable even if the true assignment matrix is known. Secondly, we address the issue of identifiability by making a one component sharing assumption and derive a parameter learning algorithm for this case. Our approach is based on a dictionary learning problem of the form $X = O R$, where the goal is to learn the dictionary $O$ given the data matrix $X$. We argue that due to the specific structure of the activation matrix $R$ in the shared component factorial mixture model, and an incoherence assumption on the shared component, it is possible to extract the columns of the $O$ matrix without the need for alternating between the estimation of $O$ and $R$.
A dynamical neural network consists of a set of interconnected neurons that interact over time continuously. It can exhibit computational properties in the sense that the dynamical systems evolution and/or limit points in the associated state space can correspond to numerical solutions to certain mathematical optimization or learning problems. Such a computational system is particularly attractive in that it can be mapped to a massively parallel computer architecture for power and throughput efficiency, especially if each neuron can rely solely on local information (i.e., local memory). Deriving gradients from the dynamical networks various states while conforming to this last constraint, however, is challenging. We show that by combining ideas of top-down feedback and contrastive learning, a dynamical network for solving the l1-minimizing dictionary learning problem can be constructed, and the true gradients for learning are provably computable by individual neurons. Using spiking neurons to construct our dynamical network, we present a learning process, its rigorous mathematical analysis, and numerical results on several dictionary learning problems.
Federated learning (FL) has emerged as a prominent distributed learning paradigm. FL entails some pressing needs for developing novel parameter estimation approaches with theoretical guarantees of convergence, which are also communication efficient, differentially private and Byzantine resilient in the heterogeneous data distribution settings. Quantization-based SGD solvers have been widely adopted in FL and the recently proposed SIGNSGD with majority vote shows a promising direction. However, no existing methods enjoy all the aforementioned properties. In this paper, we propose an intuitively-simple yet theoretically-sound method based on SIGNSGD to bridge the gap. We present Stochastic-Sign SGD which utilizes novel stochastic-sign based gradient compressors enabling the aforementioned properties in a unified framework. We also present an error-feedback variant of the proposed Stochastic-Sign SGD which further improves the learning performance in FL. We test the proposed method with extensive experiments using deep neural networks on the MNIST dataset and the CIFAR-10 dataset. The experimental results corroborate the effectiveness of the proposed method.
Machine learning has shown much promise in helping improve the quality of medical, legal, and economic decision-making. In these applications, machine learning models must satisfy two important criteria: (i) they must be causal, since the goal is typically to predict individual treatment effects, and (ii) they must be interpretable, so that human decision makers can validate and trust the model predictions. There has recently been much progress along each direction independently, yet the state-of-the-art approaches are fundamentally incompatible. We propose a framework for learning causal interpretable models---from observational data---that can be used to predict individual treatment effects. Our framework can be used with any algorithm for learning interpretable models. Furthermore, we prove an error bound on the treatment effects predicted by our model. Finally, in an experiment on real-world data, we show that the models trained using our framework significantly outperform a number of baselines.
In over two decades of research, the field of dictionary learning has gathered a large collection of successful applications, and theoretical guarantees for model recovery are known only whenever optimization is carried out in the same model class as that of the underlying dictionary. This work characterizes the surprising phenomenon that dictionary recovery can be facilitated by searching over the space of larger over-realized models. This observation is general and independent of the specific dictionary learning algorithm used. We thoroughly demonstrate this observation in practice and provide an analysis of this phenomenon by tying recovery measures to generalization bounds. In particular, we show that model recovery can be upper-bounded by the empirical risk, a model-dependent quantity and the generalization gap, reflecting our empirical findings. We further show that an efficient and provably correct distillation approach can be employed to recover the correct atoms from the over-realized model. As a result, our meta-algorithm provides dictionary estimates with consistently better recovery of the ground-truth model.