No Arabic abstract
This paper presents and analyzes an approach to cluster-based inference for dependent data. The primary setting considered here is with spatially indexed data in which the dependence structure of observed random variables is characterized by a known, observed dissimilarity measure over spatial indices. Observations are partitioned into clusters with the use of an unsupervised clustering algorithm applied to the dissimilarity measure. Once the partition into clusters is learned, a cluster-based inference procedure is applied to a statistical hypothesis testing procedure. The procedure proposed in the paper allows the number of clusters to depend on the data, which gives researchers a principled method for choosing an appropriate clustering level. The paper gives conditions under which the proposed procedure asymptotically attains correct size. A simulation study shows that the proposed procedure attains near nominal size in finite samples in a variety of statistical testing problems with dependent data.
For high-dimensional inference problems, statisticians have a number of competing interests. On the one hand, procedures should provide accurate estimation, reliable structure learning, and valid uncertainty quantification. On the other hand, procedures should be computationally efficient and able to scale to very high dimensions. In this note, I show that a very simple data-dependent measure can achieve all of these desirable properties simultaneously, along with some robustness to the error distribution, in sparse sequence models.
In this paper, we survey some recent results on statistical inference (parametric and nonparametric statistical estimation, hypotheses testing) about the spectrum of stationary models with tapered data, as well as, a question concerning robustness of inferences, carried out on a linear stationary process contaminated by a small trend. We also discuss some question concerning tapered Toeplitz matrices and operators, central limit theorems for tapered Toeplitz type quadratic functionals, and tapered Fejer-type kernels and singular integrals. These are the main tools for obtaining the corresponding results, and also are of interest in themselves. The processes considered will be discrete-time and continuous-time Gaussian, linear or Levy-driven linear processes with memory.
This paper considers distributed statistical inference for general symmetric statistics %that encompasses the U-statistics and the M-estimators in the context of massive data where the data can be stored at multiple platforms in different locations. In order to facilitate effective computation and to avoid expensive communication among different platforms, we formulate distributed statistics which can be conducted over smaller data blocks. The statistical properties of the distributed statistics are investigated in terms of the mean square error of estimation and asymptotic distributions with respect to the number of data blocks. In addition, we propose two distributed bootstrap algorithms which are computationally effective and are able to capture the underlying distribution of the distributed statistics. Numerical simulation and real data applications of the proposed approaches are provided to demonstrate the empirical performance.
We consider X 1 ,. .. , X n a sample of data on the circle S 1 , whose distribution is a twocomponent mixture. Denoting R and Q two rotations on S 1 , the density of the X i s is assumed to be g(x) = pf (R --1 x) + (1 -- p)f (Q --1 x), where p $in$ (0, 1) and f is an unknown density on the circle. In this paper we estimate both the parametric part $theta$ = (p, R, Q) and the nonparametric part f. The specific problems of identifiability on the circle are studied. A consistent estimator of $theta$ is introduced and its asymptotic normality is proved. We propose a Fourier-based estimator of f with a penalized criterion to choose the resolution level. We show that our adaptive estimator is optimal from the oracle and minimax points of view when the density belongs to a Sobolev ball. Our method is illustrated by numerical simulations.
This paper considers the maximum generalized empirical likelihood (GEL) estimation and inference on parameters identified by high dimensional moment restrictions with weakly dependent data when the dimensions of the moment restrictions and the parameters diverge along with the sample size. The consistency with rates and the asymptotic normality of the GEL estimator are obtained by properly restricting the growth rates of the dimensions of the parameters and the moment restrictions, as well as the degree of data dependence. It is shown that even in the high dimensional time series setting, the GEL ratio can still behave like a chi-square random variable asymptotically. A consistent test for the over-identification is proposed. A penalized GEL method is also provided for estimation under sparsity setting.