We establish exponential bounds for the hypergeometric distribution which include a finite sampling correction factor, but are otherwise analogous to bounds for the binomial distribution due to Leon and Perron (2003) and Talagrand (1994). We also establish a convex ordering for sampling without replacement from populations of real numbers between zero and one: a population of all zeros or ones (and hence yielding a hypergeometric distribution in the upper bound) gives the extreme case.
We study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp PAC-Bayesian risk bounds for aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We then apply these results to derive sparsity oracle inequalities.
In this paper, we introduce a new three-parameter distribution based on the combination of re-parametrization of the so-called EGNB2 and transmuted exponential distributions. This combination aims to modify the transmuted exponential distribution via the incorporation of an additional parameter, mainly adding a high degree of flexibility on the mode and impacting the skewness and kurtosis of the tail. We explore some mathematical properties of this distribution including the hazard rate function, moments, the moment generating function, the quantile function, various entropy measures and (reversed) residual life functions. A statistical study investigates estimation of the parameters using the method of maximum likelihood. The distribution along with other existing distributions are fitted to two environmental data sets and its superior performance is assessed by using some goodness-of-fit tests. As a result, some environmental measures associated with these data are obtained such as the return level and mean deviation about this level.
We analyse the reconstruction error of principal component analysis (PCA) and prove non-asymptotic upper bounds for the corresponding excess risk. These bounds unify and improve existing upper bounds from the literature. In particular, they give oracle inequalities under mild eigenvalue conditions. The bounds reveal that the excess risk differs significantly from usually considered subspace distances based on canonical angles. Our approach relies on the analysis of empirical spectral projectors combined with concentration inequalities for weighted empirical covariance operators and empirical eigenvalues.
We consider the problem of finding confidence intervals for the risk of forecasting the future of a stationary, ergodic stochastic process, using a model estimated from the past of the process. We show that a bootstrap procedure provides valid confidence intervals for the risk, when the data source is sufficiently mixing, and the loss function and the estimator are suitably smooth. Autoregressive (AR(d)) models estimated by least squares obey the necessary regularity conditions, even when mis-specified, and simulations show that the finite- sample coverage of our bounds quickly converges to the theoretical, asymptotic level. As an intermediate step, we derive sufficient conditions for asymptotic independence between empirical distribution functions formed by splitting a realization of a stochastic process, of independent interest.
Historically, to bound the mean for small sample sizes, practitioners have had to choose between using methods with unrealistic assumptions about the unknown distribution (e.g., Gaussianity) and methods like Hoeffdings inequality that use weaker assumptions but produce much looser (wider) intervals. In 1969, Anderson (1969) proposed a mean confidence interval strictly better than or equal to Hoeffdings whose only assumption is that the distributions support is contained in an interval $[a,b]$. For the first time since then, we present a new family of bounds that compares favorably to Andersons. We prove that each bound in the family has {em guaranteed coverage}, i.e., it holds with probability at least $1-alpha$ for all distributions on an interval $[a,b]$. Furthermore, one of the bounds is tighter than or equal to Andersons for all samples. In simulations, we show that for many distributions, the gain over Andersons bound is substantial.