No Arabic abstract
What is the optimal number of independent observations from which a sparse Gaussian Graphical Model can be correctly recovered? Information-theoretic arguments provide a lower bound on the minimum number of samples necessary to perfectly identify the support of any multivariate normal distribution as a function of model parameters. For a model defined on a sparse graph with $p$ nodes, a maximum degree $d$ and minimum normalized edge strength $kappa$, this necessary number of samples scales at least as $d log p/kappa^2$. The sample complexity requirements of existing methods for perfect graph reconstruction exhibit dependency on additional parameters that do not enter in the lower bound. The question of whether the lower bound is tight and achievable by a polynomial time algorithm remains open. In this paper, we constructively answer this question and propose an algorithm, termed DICE, whose sample complexity matches the information-theoretic lower bound up to a universal constant factor. We also propose a related algorithm SLICE that has a slightly higher sample complexity, but can be implemented as a mixed integer quadratic program which makes it attractive in practice. Importantly, SLICE retains a critical advantage of DICE in that its sample complexity only depends on quantities present in the information theoretic lower bound. We anticipate that this result will stimulate future search of computationally efficient sample-optimal algorithms.
Graphical models are useful tools for describing structured high-dimensional probability distributions. Development of efficient algorithms for learning graphical models with least amount of data remains an active research topic. Reconstruction of graphical models that describe the statistics of discrete variables is a particularly challenging problem, for which the maximum likelihood approach is intractable. In this work, we provide the first sample-efficient method based on the Interaction Screening framework that allows one to provably learn fully general discrete factor models with node-specific discrete alphabets and multi-body interactions, specified in an arbitrary basis. We identify a single condition related to model parametrization that leads to rigorous guarantees on the recovery of model structure and parameters in any error norm, and is readily verifiable for a large class of models. Importantly, our bounds make explicit distinction between parameters that are proper to the model and priors used as an input to the algorithm. Finally, we show that the Interaction Screening framework includes all models previously considered in the literature as special cases, and for which our analysis shows a systematic improvement in sample complexity.
Graphical model selection in Markov random fields is a fundamental problem in statistics and machine learning. Two particularly prominent models, the Ising model and Gaussian model, have largely developed in parallel using different (though often related) techniques, and several practical algorithms with rigorous sample complexity bounds have been established for each. In this paper, we adapt a recently proposed algorithm of Klivans and Meka (FOCS, 2017), based on the method of multiplicative weight updates, from the Ising model to the Gaussian model, via non-trivial modifications to both the algorithm and its analysis. The algorithm enjoys a sample complexity bound that is qualitatively similar to others in the literature, has a low runtime $O(mp^2)$ in the case of $m$ samples and $p$ nodes, and can trivially be implemented in an online manner.
We address the question of characterizing and finding optimal representations for supervised learning. Traditionally, this question has been tackled using the Information Bottleneck, which compresses the inputs while retaining information about the targets, in a decoder-agnostic fashion. In machine learning, however, our goal is not compression but rather generalization, which is intimately linked to the predictive family or decoder of interest (e.g. linear classifier). We propose the Decodable Information Bottleneck (DIB) that considers information retention and compression from the perspective of the desired predictive family. As a result, DIB gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees. Empirically, we show that the framework can be used to enforce a small generalization gap on downstream classifiers and to predict the generalization ability of neural networks.
Safely deploying machine learning models to the real world is often a challenging process. Models trained with data obtained from a specific geographic location tend to fail when queried with data obtained elsewhere, agents trained in a simulation can struggle to adapt when deployed in the real world or novel environments, and neural networks that are fit to a subset of the population might carry some selection bias into their decision process. In this work, we describe the problem of data shift from a novel information-theoretic perspective by (i) identifying and describing the different sources of error, (ii) comparing some of the most promising objectives explored in the recent domain generalization, and fair classification literature. From our theoretical analysis and empirical evaluation, we conclude that the model selection procedure needs to be guided by careful considerations regarding the observed data, the factors used for correction, and the structure of the data-generating process.
The overall predictive uncertainty of a trained predictor can be decomposed into separate contributions due to epistemic and aleatoric uncertainty. Under a Bayesian formulation, assuming a well-specified model, the two contributions can be exactly expressed (for the log-loss) or bounded (for more general losses) in terms of information-theoretic quantities (Xu and Raginsky, 2020). This paper addresses the study of epistemic uncertainty within an information-theoretic framework in the broader setting of Bayesian meta-learning. A general hierarchical Bayesian model is assumed in which hyperparameters determine the per-task priors of the model parameters. Exact characterizations (for the log-loss) and bounds (for more general losses) are derived for the epistemic uncertainty - quantified by the minimum excess meta-risk (MEMR)- of optimal meta-learning rules. This characterization is leveraged to bring insights into the dependence of the epistemic uncertainty on the number of tasks and on the amount of per-task training data. Experiments are presented that compare the proposed information-theoretic bounds, evaluated via neural mutual information estimators, with the performance of a novel approximate fully Bayesian meta-learning strategy termed Langevin-Stein Bayesian Meta-Learning (LS-BML).