Chebyshev polynomials, moment matching, and optimal estimation of the unseen


Abstract in English

We consider the problem of estimating the support size of a discrete distribution whose minimum non-zero mass is at least $ frac{1}{k}$. Under the independent sampling model, we show that the sample complexity, i.e., the minimal sample size to achieve an additive error of $epsilon k$ with probability at least 0.1 is within universal constant factors of $ frac{k}{log k}log^2frac{1}{epsilon} $, which improves the state-of-the-art result of $ frac{k}{epsilon^2 log k} $ in cite{VV13}. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in $O(n+log^2 k)$ time and attains the sample complexity within a factor of six asymptotically. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.

Download