Optimal Rates for Learning Hidden Tree Structures

217 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Konstantinos Nikolakakis

تاريخ النشر 2019

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Konstantinos E. Nikolakakis - Dionysios S. Kalogerias - Anand D. Sarwate

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We provide high probability finite sample complexity guarantees for hidden non-parametric structure learning of tree-shaped graphical models, whose hidden and observable nodes are discrete random variables with either finite or countable alphabets. We study a fundamental quantity called the (noisy) information threshold, which arises naturally from the error analysis of the Chow-Liu algorithm and, as we discuss, provides explicit necessary and sufficient conditions on sample complexity, by effectively summarizing the difficulty of the tree-structure learning problem. Specifically, we show that the finite sample complexity of the Chow-Liu algorithm for ensuring exact structure recovery from noisy data is inversely proportional to the information threshold squared (provided it is positive), and scales almost logarithmically relative to the number of nodes over a given probability of failure. Conversely, we show that, if the number of samples is less than an absolute constant times the inverse of information threshold squared, then no algorithm can recover the hidden tree structure with probability greater than one half. As a consequence, our upper and lower bounds match with respect to the information threshold, indicating that it is a fundamental quantity for the problem of learning hidden tree-structured models. Further, the Chow-Liu algorithm with noisy data as input achieves the optimal rate with respect to the information threshold. Lastly, as a byproduct of our analysis, we resolve the problem of tree structure learning in the presence of non-identically distributed observation noise, providing conditions for convergence of the Chow-Liu algorithm under this setting, as well.

قيم البحث

60 - Konstantinos E. Nikolakakis , Dionysios S. Kalogerias , Anand D.n Sarwate 2018

We provide high-probability sample complexity guarantees for exact structure recovery and accurate predictive learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising mo del distribution, whereas the observable variables are generated by a binary symmetric channel taking the hidden variables as its input (flipping each bit independently with some constant probability $qin [0,1/2)$). In the absence of noise, predictive learning on Ising models was recently studied by Bresler and Karzand (2020); this paper quantifies how noise in the hidden model impacts the tasks of structure recovery and marginal distribution estimation by proving upper and lower bounds on the sample complexity. Our results generalize state-of-the-art bounds reported in prior work, and they exactly recover the noiseless case ($q=0$). In fact, for any tree with $p$ vertices and probability of incorrect recovery $delta>0$, the sufficient number of samples remains logarithmic as in the noiseless case, i.e., $mathcal{O}(log(p/delta))$, while the dependence on $q$ is $mathcal{O}big( 1/(1-2q)^{4} big)$, for both aforementioned tasks. We also present a new equivalent of Isserlis Theorem for sign-valued tree-structured distributions, yielding a new low-complexity algorithm for higher-order moment estimation.

التعلم الالي نظرية المعلومات التعلم الآلي

Model Selection with Near Optimal Rates for Reinforcement Learning with General Model Classes

89 - Avishek Ghosh , Sayak Ray Chowdhury , Kannan Ramchandran 2021

We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem where the transition kernel $P^*$ belongs to a family of models $mathcal{P}^*$ with finite metric entropy. In the model selection framework, instead of $mathcal{P}^*$, we are given $M$ nested families of transition kernels $cP_1 subset cP_2 subset ldots subset cP_M$. We propose and analyze a novel algorithm, namely emph{Adaptive Reinforcement Learning (General)} (texttt{ARL-GEN}) that adapts to the smallest such family where the true transition kernel $P^*$ lies. texttt{ARL-GEN} uses the Upper Confidence Reinforcement Learning (texttt{UCRL}) algorithm with value targeted regression as a blackbox and puts a model selection module at the beginning of each epoch. Under a mild separability assumption on the model classes, we show that texttt{ARL-GEN} obtains a regret of $Tilde{mathcal{O}}(d_{mathcal{E}}^*H^2+sqrt{d_{mathcal{E}}^* mathbb{M}^* H^2 T})$, with high probability, where $H$ is the horizon length, $T$ is the total number of steps, $d_{mathcal{E}}^*$ is the Eluder dimension and $mathbb{M}^*$ is the metric entropy corresponding to $mathcal{P}^*$. Note that this regret scaling matches that of an oracle that knows $mathcal{P}^*$ in advance. We show that the cost of model selection for texttt{ARL-GEN} is an additive term in the regret having a weak dependence on $T$. Subsequently, we remove the separability assumption and consider the setup of linear mixture MDPs, where the transition kernel $P^*$ has a linear function approximation. With this low rank structure, we propose novel adaptive algorithms for model selection, and obtain (order-wise) regret identical to that of an oracle with knowledge of the true model class.

التعلم الالي نظرية المعلومات التعلم الآلي

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

100 - Gen Li , Changxiao Cai , Yuxin Chen 2021

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-act ion pairs are drawn from a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. Take a $gamma$-discounted infinite-horizon MDP with state space $mathcal{S}$ and action space $mathcal{A}$: to yield an entrywise $varepsilon$-accurate estimate of the optimal Q-function, state-of-the-art theory for Q-learning proves that a sample size on the order of $frac{|mathcal{S}||mathcal{A}|}{(1-gamma)^5varepsilon^{2}}$ is sufficient, which, however, fails to match with the existing minimax lower bound. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? In this work, we settle these questions by (1) demonstrating that the sample complexity of Q-learning is at most on the order of $frac{|mathcal{S}||mathcal{A}|}{(1-gamma)^4varepsilon^2}$ (up to some log factor) for any $0<varepsilon <1$, and (2) developing a matching lower bound to confirm the sharpness of our result. Our findings unveil both the effectiveness and limitation of Q-learning: its sample complexity matches that of speedy Q-learning without requiring extra computation and storage, albeit still being considerably higher than the minimax lower bound.

التعلم الالي نظرية المعلومات التعلم الآلي

Optimal Combination of Linear and Spectral Estimators for Generalized Linear Models

127 - Marco Mondelli , Christos Thrampoulidis , Ramji Venkataramanan 2020

We study the problem of recovering an unknown signal $boldsymbol x$ given measurements obtained from a generalized linear model with a Gaussian sensing matrix. Two popular solutions are based on a linear estimator $hat{boldsymbol x}^{rm L}$ and a spe ctral estimator $hat{boldsymbol x}^{rm s}$. The former is a data-dependent linear combination of the columns of the measurement matrix, and its analysis is quite simple. The latter is the principal eigenvector of a data-dependent matrix, and a recent line of work has studied its performance. In this paper, we show how to optimally combine $hat{boldsymbol x}^{rm L}$ and $hat{boldsymbol x}^{rm s}$. At the heart of our analysis is the exact characterization of the joint empirical distribution of $(boldsymbol x, hat{boldsymbol x}^{rm L}, hat{boldsymbol x}^{rm s})$ in the high-dimensional limit. This allows us to compute the Bayes-optimal combination of $hat{boldsymbol x}^{rm L}$ and $hat{boldsymbol x}^{rm s}$, given the limiting distribution of the signal $boldsymbol x$. When the distribution of the signal is Gaussian, then the Bayes-optimal combination has the form $thetahat{boldsymbol x}^{rm L}+hat{boldsymbol x}^{rm s}$ and we derive the optimal combination coefficient. In order to establish the limiting distribution of $(boldsymbol x, hat{boldsymbol x}^{rm L}, hat{boldsymbol x}^{rm s})$, we design and analyze an Approximate Message Passing (AMP) algorithm whose iterates give $hat{boldsymbol x}^{rm L}$ and approach $hat{boldsymbol x}^{rm s}$. Numerical simulations demonstrate the improvement of the proposed combination with respect to the two methods considered separately.

التعلم الالي نظرية المعلومات التعلم الآلي

Spectral Method and Regularized MLE Are Both Optimal for Top-$K$ Ranking

92 - Yuxin Chen , Jianqing Fan , Cong Ma 2017

This paper is concerned with the problem of top-$K$ ranking from pairwise comparisons. Given a collection of $n$ items and a few pairwise comparisons across them, one wishes to identify the set of $K$ items that receive the highest ranks. To tackle t his problem, we adopt the logistic parametric model --- the Bradley-Terry-Luce model, where each item is assigned a latent preference score, and where the outcome of each pairwise comparison depends solely on the relative scores of the two items involved. Recent works have made significant progress towards characterizing the performance (e.g. the mean square error for estimating the scores) of several classical methods, including the spectral method and the maximum likelihood estimator (MLE). However, where they stand regarding top-$K$ ranking remains unsettled. We demonstrate that under a natural random sampling model, the spectral method alone, or the regularized MLE alone, is minimax optimal in terms of the sample complexity --- the number of paired comparisons needed to ensure exact top-$K$ identification, for the fixed dynamic range regime. This is accomplished via optimal control of the entrywise error of the score estimates. We complement our theoretical studies by numerical experiments, confirming that both methods yield low entrywise errors for estimating the underlying scores. Our theory is established via a novel leave-one-out trick, which proves effective for analyzing both iterative and non-iterative procedures. Along the way, we derive an elementary eigenvector perturbation bound for probability transition matrices, which parallels the Davis-Kahan $sinTheta$ theorem for symmetric matrices. This also allows us to close the gap between the $ell_2$ error upper bound for the spectral method and the minimax lower limit.

التعلم الالي نظرية المعلومات التعلم الآلي