Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Likelihood Ratio Exponential Families

89 0 0.0 ( 0 )

Download Cite

Added by Rob Brekelmans

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Rob Brekelmans - Frank Nielsen - Alireza Makhzani

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The exponential family is well known in machine learning and statistical physics as the maximum entropy distribution subject to a set of observed constraints, while the geometric mixture path is common in MCMC methods such as annealed importance sampling. Linking these two ideas, recent work has interpreted the geometric mixture path as an exponential family of distributions to analyze the thermodynamic variational objective (TVO). We extend these likelihood ratio exponential families to include solutions to rate-distortion (RD) optimization, the information bottleneck (IB) method, and recent rate-distortion-classification approaches which combine RD and IB. This provides a common mathematical framework for understanding these methods via the conjugate duality of exponential families and hypothesis testing. Further, we collect existing results to provide a variational representation of intermediate RD or TVO distributions as a minimizing an expectation of KL divergences. This solution also corresponds to a size-power tradeoff using the likelihood ratio test and the Neyman Pearson lemma. In thermodynamic integration bounds such as the TVO, we identify the intermediate distribution whose expected sufficient statistics match the log partition function.

rate research

Exponential Savings in Agnostic Active Learning through Abstention

148 - Nikita Puchkin , Nikita Zhivotovskiy 2021

We show that in pool-based active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss $1/2$ of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in pool-based active classification under the model misspecification.

Machine Learning Information Theory Information Theory

The Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

223 - Pragya Sur , Yuxin Chen , Emmanuel J. Cand`es 2017

Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test. Indeed, Wilks theorem asserts that whenever we have a fixed number $p$ of variables, twice the log-likelihood ratio (LLR) $2Lambda$ is distributed as a $chi^2_k$ variable in the limit of large sample sizes $n$; here, $k$ is the number of variables being tested. In this paper, we prove that when $p$ is not negligible compared to $n$, Wilks theorem does not hold and that the chi-square approximation is grossly incorrect; in fact, this approximation produces p-values that are far too small (under the null hypothesis). Assume that $n$ and $p$ grow large in such a way that $p/nrightarrowkappa$ for some constant $kappa < 1/2$. We prove that for a class of logistic models, the LLR converges to a rescaled chi-square, namely, $2Lambda~stackrel{mathrm{d}}{rightarrow}~alpha(kappa)chi_k^2$, where the scaling factor $alpha(kappa)$ is greater than one as soon as the dimensionality ratio $kappa$ is positive. Hence, the LLR is larger than classically assumed. For instance, when $kappa=0.3$, $alpha(kappa)approx1.5$. In general, we show how to compute the scaling factor by solving a nonlinear system of two equations with two unknowns. Our mathematical arguments are involved and use techniques from approximate message passing theory, non-asymptotic random matrix theory and convex geometry. We also complement our mathematical study by showing that the new limiting distribution is accurate for finite sample sizes. Finally, all the results from this paper extend to some other regression models such as the probit regression model.

Statistics Theory Information Theory Information Theory

FERMI: Fair Empirical Risk Minimization via Exponential Renyi Mutual Information

92 - Andrew Lowy , Rakesh Pavan , Sina Baharlouei 2021

Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across a variety of machine learning tasks, fair ERM is hindered by the incompatibility of fairness constraints with stochastic optimization. In this paper, we propose the fair empirical risk minimization via exponential Renyi mutual information (FERMI) framework. FERMI is built on a stochastic estimator for exponential Renyi mutual information (ERMI), an information divergence measuring the degree of the dependence of predictions on sensitive attributes. Theoretically, we show that ERMI upper bounds existing popular fairness violation metrics, thus controlling ERMI provides guarantees on other commonly used violations, such as $L_infty$. We derive an unbiased estimator for ERMI, which we use to derive the FERMI algorithm. We prove that FERMI converges for demographic parity, equalized odds, and equal opportunity notions of fairness in stochastic optimization. Empirically, we show that FERMI is amenable to large-scale problems with multiple (non-binary) sensitive attributes and non-binary targets. Extensive experiments show that FERMI achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups compared with state-of-the-art baselines for demographic parity, equalized odds, equal opportunity. These benefits are especially significant for non-binary classification with large sensitive sets and small batch sizes, showcasing the effectiveness of the FERMI objective and the developed stochastic algorithm for solving it.

Machine Learning Information Theory Information Theory

Learning Continuous Exponential Families Beyond Gaussian

171 - Christopher X. Ren , Sidhant Misra , Marc Vuffray 2021

We address the problem of learning of continuous exponential family distributions with unbounded support. While a lot of progress has been made on learning of Gaussian graphical models, we are still lacking scalable algorithms for reconstructing general continuous exponential families modeling higher-order moments of the data beyond the mean and the covariance. Here, we introduce a computationally efficient method for learning continuous graphical models based on the Interaction Screening approach. Through a series of numerical experiments, we show that our estimator maintains similar requirements in terms of accuracy and sample complexity compared to alternative approaches such as maximization of conditional likelihood, while considerably improving upon the algorithms run-time.

Machine Learning Data Analysis Statistics and Probability Machine Learning

An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

93 - Shashank Rajput , Kartik Sreenivasan , Dimitris Papailiopoulos 2021

It is well known that modern deep neural networks are powerful enough to memorize datasets even when the labels have been randomized. Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $widetilde{mathcal{O}}(e^{1/delta^2}+sqrt{n})$ neurons and $widetilde{mathcal{O}}(e^{1/delta^2}(d+sqrt{n})+n)$ weights, where $delta$ is the minimum distance between the points. In this work, we improve the dependence on $delta$ from exponential to almost linear, proving that $widetilde{mathcal{O}}(frac{1}{delta}+sqrt{n})$ neurons and $widetilde{mathcal{O}}(frac{d}{delta}+n)$ weights are sufficient. Our construction uses Gaussian random weights only in the first layer, while all the subsequent layers use binary or integer weights. We also prove new lower bounds by connecting memorization in neural networks to the purely geometric problem of separating $n$ points on a sphere using hyperplanes.

Machine Learning Information Theory Information Theory

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Likelihood Ratio Exponential Families

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions