No Arabic abstract
This paper presents a new approach, called perturb-max, for high-dimensional statistical inference that is based on applying random perturbations followed by optimization. This framework injects randomness to maximum a-posteriori (MAP) predictors by randomly perturbing the potential function for the input. A classic result from extreme value statistics asserts that perturb-max operations generate unbiased samples from the Gibbs distribution using high-dimensional perturbations. Unfortunately, the computational cost of generating so many high-dimensional random variables can be prohibitive. However, when the perturbations are of low dimension, sampling the perturb-max prediction is as efficient as MAP optimization. This paper shows that the expected value of perturb-max inference with low dimensional perturbations can be used sequentially to generate unbiased samples from the Gibbs distribution. Furthermore the expected value of the maximal perturbations is a natural bound on the entropy of such perturb-max models. A measure concentration result for perturb-max values shows that the deviation of their sampled average from its expectation decays exponentially in the number of samples, allowing effective approximation of the expectation.
We consider the problem of identifying universal low-dimensional features from high-dimensional data for inference tasks in settings involving learning. For such problems, we introduce natural notions of universality and we show a local equivalence among them. Our analysis is naturally expressed via information geometry, and represents a conceptually and computationally useful analysis. The development reveals the complementary roles of the singular value decomposition, Hirschfeld-Gebelein-Renyi maximal correlation, the canonical correlation and principle component analyses of Hotelling and Pearson, Tishbys information bottleneck, Wyners common information, Ky Fan $k$-norms, and Brieman and Friedmans alternating conditional expectations algorithm. We further illustrate how this framework facilitates understanding and optimizing aspects of learning systems, including multinomial logistic (softmax) regression and the associated neural network architecture, matrix factorization methods for collaborative filtering and other applications, rank-constrained multivariate linear regression, and forms of semi-supervised learning.
We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings while achieving similar or better final performance.
Channel and frequency offset estimation is a classic topic with a large body of prior work using mainly maximum likelihood (ML) approach together with Cramer-Rao Lower bounds (CRLB) analysis. We provide the maximum a posteriori (MAP) estimation solution which is particularly useful for for tracking where previous estimation can be used as prior knowledge. Unlike the ML cases, the corresponding Bayesian Cramer-Rao Lower bound (BCRLB) shows clear relation with parameters and a low complexity algorithm achieves the BCRLB in almost all SNR range. We allow the time invariant channel within a packet to have arbitrary correlation and mean. The estimation is based on pilot/training signals. An unexpected result is that the joint MAP estimation is equivalent to an individual MAP estimation of the frequency offset first, again different from the ML results. We provide insight on the pilot/training signal design based on the BCRLB. Unlike past algorithms that trade performance and/or complexity for the accommodation of time varying channels, the MAP solution provides a different route for dealing with time variation. Within a short enough (segment of) packet where the channel and CFO are approximately time invariant, the low complexity algorithm can be employed. Similar to belief propagation, the estimation of the previous (segment of) packet can serve as the prior knowledge for the next (segment of) packet.
Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, Levy and Gaussian distributions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise.
The design of methods for inference from time sequences has traditionally relied on statistical models that describe the relation between a latent desired sequence and the observed one. A broad family of model-based algorithms have been derived to carry out inference at controllable complexity using recursive computations over the factor graph representing the underlying distribution. An alternative model-agnostic approach utilizes machine learning (ML) methods. Here we propose a framework that combines model-based algorithms and data-driven ML tools for stationary time sequences. In the proposed approach, neural networks are developed to separately learn specific components of a factor graph describing the distribution of the time sequence, rather than the complete inference task. By exploiting stationary properties of this distribution, the resulting approach can be applied to sequences of varying temporal duration. Learned factor graph can be realized using compact neural networks that are trainable using small training sets, or alternatively, be used to improve upon existing deep inference systems. We present an inference algorithm based on learned stationary factor graphs, which learns to implement the sum-product scheme from labeled data, and can be applied to sequences of different lengths. Our experimental results demonstrate the ability of the proposed learned factor graphs to learn to carry out accurate inference from small training sets for sleep stage detection using the Sleep-EDF dataset, as well as for symbol detection in digital communications with unknown channels.