ترغب بنشر مسار تعليمي؟ اضغط هنا

Biwhitening Reveals the Rank of a Count Matrix

80   0   0.0 ( 0 )
 نشر من قبل Boris Landa
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Estimating the rank of a corrupted data matrix is an important task in data science, most notably for choosing the number of components in principal component analysis. Significant progress on this task has been made using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, such as Poisson or binomial, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, focusing on a Poisson random matrix with independent entries, we propose a simple procedure termed textit{biwhitening} that makes it possible to estimate the rank of the underlying data matrix (i.e., the Poisson parameter matrix) without any prior knowledge on its structure. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson distribution, we extend our biwhitening approach to other discrete distributions, such as the generalized Poisson, binomial, multinomial, and negative binomial. We conduct numerical experiments that corroborate our theoretical findings, and demonstrate our approach on real single-cell RNA sequencing (scRNA-seq) data, where we show that our results agree with a slightly overdispersed generalized Poisson model.



قيم البحث

اقرأ أيضاً

In many applications it is useful to replace the Moore-Penrose pseudoinverse (MPP) by a different generalized inverse with more favorable properties. We may want, for example, to have many zero entries, but without giving up too much of the stability of the MPP. One way to quantify stability is by how much the Frobenius norm of a generalized inverse exceeds that of the MPP. In this paper we derive finite-size concentration bounds for the Frobenius norm of $ell^p$-minimal general inverses of iid Gaussian matrices, with $1 leq p leq 2$. For $p = 1$ we prove exponential concentration of the Frobenius norm of the sparse pseudoinverse; for $p = 2$, we get a similar concentration bound for the MPP. Our proof is based on the convex Gaussian min-max theorem, but unlike previous applications which give asymptotic results, we derive finite-size bounds.
217 - Yi-Kai Liu 2011
We study the problem of reconstructing an unknown matrix M of rank r and dimension d using O(rd poly log d) Pauli measurements. This has applications in quantum state tomography, and is a non-commutative analogue of a well-known problem in compressed sensing: recovering a sparse vector from a few of its Fourier coefficients. We show that almost all sets of O(rd log^6 d) Pauli measurements satisfy the rank-r restricted isometry property (RIP). This implies that M can be recovered from a fixed (universal) set of Pauli measurements, using nuclear-norm minimization (e.g., the matrix Lasso), with nearly-optimal bounds on the error. A similar result holds for any class of measurements that use an orthonormal operator basis whose elements have small operator norm. Our proof uses Dudleys inequality for Gaussian processes, together with bounds on covering numbers obtained via entropy duality.
We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the compositi on of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of $m =omega(k/log k)$, it is possible to consistently estimate in total variation the emph{profile} of the population, defined as the empirical distribution of the sizes of each type, which determines many symmetric properties of the population. We also prove that in the linear regime of $m=c k$ for any constant $c$ the optimal rate is $Theta(1/log k)$. Our estimator is based on Wolfowitzs minimum distance method, which entails solving a linear program (LP) of size $k$. We show that there is a single infinite-dimensional LP whose value simultaneously characterizes the risk of the minimum distance estimator and certifies its minimax optimality. The sharp convergence rate is obtained by evaluating this LP using complex-analytic techniques.
70 - De Huang 2018
In this paper we prove the concavity of the $k$-trace functions, $Amapsto (text{Tr}_k[exp(H+ln A)])^{1/k}$, on the convex cone of all positive definite matrices. $text{Tr}_k[A]$ denotes the $k_{mathrm{th}}$ elementary symmetric polynomial of the eige nvalues of $A$. As an application, we use the concavity of these $k$-trace functions to derive tail bounds and expectation estimates on the sum of the $k$ largest (or smallest) eigenvalues of a sum of random matrices.
212 - Jean Lafond 2014
The task of reconstructing a matrix given a sample of observedentries is known as the matrix completion problem. It arises ina wide range of problems, including recommender systems, collaborativefiltering, dimensionality reduction, image processing, quantum physics or multi-class classificationto name a few. Most works have focused on recovering an unknown real-valued low-rankmatrix from randomly sub-sampling its entries.Here, we investigate the case where the observations take a finite number of values, corresponding for examples to ratings in recommender systems or labels in multi-class classification.We also consider a general sampling scheme (not necessarily uniform) over the matrix entries.The performance of a nuclear-norm penalized estimator is analyzed theoretically.More precisely, we derive bounds for the Kullback-Leibler divergence between the true and estimated distributions.In practice, we have also proposed an efficient algorithm based on lifted coordinate gradient descent in order to tacklepotentially high dimensional settings.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا