No Arabic abstract
We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the composition of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of $m =omega(k/log k)$, it is possible to consistently estimate in total variation the emph{profile} of the population, defined as the empirical distribution of the sizes of each type, which determines many symmetric properties of the population. We also prove that in the linear regime of $m=c k$ for any constant $c$ the optimal rate is $Theta(1/log k)$. Our estimator is based on Wolfowitzs minimum distance method, which entails solving a linear program (LP) of size $k$. We show that there is a single infinite-dimensional LP whose value simultaneously characterizes the risk of the minimum distance estimator and certifies its minimax optimality. The sharp convergence rate is obtained by evaluating this LP using complex-analytic techniques.
We present some new results on the joint distribution of an arbitrary subset of the ordered eigenvalues of complex Wishart, double Wishart, and Gaussian hermitian random matrices of finite dimensions, using a tensor pseudo-determinant operator. Specifically, we derive compact expressions for the joint probability distribution function of the eigenvalues and the expectation of functions of the eigenvalues, including joint moments, for the case of both ordered and unordered eigenvalues.
Estimating the rank of a corrupted data matrix is an important task in data science, most notably for choosing the number of components in principal component analysis. Significant progress on this task has been made using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, such as Poisson or binomial, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, focusing on a Poisson random matrix with independent entries, we propose a simple procedure termed textit{biwhitening} that makes it possible to estimate the rank of the underlying data matrix (i.e., the Poisson parameter matrix) without any prior knowledge on its structure. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson distribution, we extend our biwhitening approach to other discrete distributions, such as the generalized Poisson, binomial, multinomial, and negative binomial. We conduct numerical experiments that corroborate our theoretical findings, and demonstrate our approach on real single-cell RNA sequencing (scRNA-seq) data, where we show that our results agree with a slightly overdispersed generalized Poisson model.
In this paper, we study the classical problem of estimating the proportion of a finite population. First, we consider a fixed sample size method and derive an explicit sample size formula which ensures a mixed criterion of absolute and relative errors. Second, we consider an inverse sampling scheme such that the sampling is continue until the number of units having a certain attribute reaches a threshold value or the whole population is examined. We have established a simple method to determine the threshold so that a prescribed relative precision is guaranteed. Finally, we develop a multistage sampling scheme for constructing fixed-width confidence interval for the proportion of a finite population. Powerful computational techniques are introduced to make it possible that the fixed-width confidence interval ensures prescribed level of coverage probability.
In this paper we prove the concavity of the $k$-trace functions, $Amapsto (text{Tr}_k[exp(H+ln A)])^{1/k}$, on the convex cone of all positive definite matrices. $text{Tr}_k[A]$ denotes the $k_{mathrm{th}}$ elementary symmetric polynomial of the eigenvalues of $A$. As an application, we use the concavity of these $k$-trace functions to derive tail bounds and expectation estimates on the sum of the $k$ largest (or smallest) eigenvalues of a sum of random matrices.
We present a simple randomized procedure for the prediction of a binary sequence. The algorithm uses ideas from recent developments of the theory of the prediction of individual sequences. We show that if the sequence is a realization of a stationary and ergodic random process then the average number of mistakes converges, almost surely, to that of the optimum, given by the Bayes predictor. The desirable finite-sample properties of the predictor are illustrated by its performance for Markov processes. In such cases the predictor exhibits near optimal behavior even without knowing the order of the Markov process. Prediction with side information is also considered.