No Arabic abstract
We consider X 1 ,. .. , X n a sample of data on the circle S 1 , whose distribution is a twocomponent mixture. Denoting R and Q two rotations on S 1 , the density of the X i s is assumed to be g(x) = pf (R --1 x) + (1 -- p)f (Q --1 x), where p $in$ (0, 1) and f is an unknown density on the circle. In this paper we estimate both the parametric part $theta$ = (p, R, Q) and the nonparametric part f. The specific problems of identifiability on the circle are studied. A consistent estimator of $theta$ is introduced and its asymptotic normality is proved. We propose a Fourier-based estimator of f with a penalized criterion to choose the resolution level. We show that our adaptive estimator is optimal from the oracle and minimax points of view when the density belongs to a Sobolev ball. Our method is illustrated by numerical simulations.
We study semiparametric efficiency bounds and efficient estimation of parameters defined through general moment restrictions with missing data. Identification relies on auxiliary data containing information about the distribution of the missing variables conditional on proxy variables that are observed in both the primary and the auxiliary database, when such distribution is common to the two data sets. The auxiliary sample can be independent of the primary sample, or can be a subset of it. For both cases, we derive bounds when the probability of missing data given the proxy variables is unknown, or known, or belongs to a correctly specified parametric family. We find that the conditional probability is not ancillary when the two samples are independent. For all cases, we discuss efficient semiparametric estimators. An estimator based on a conditional expectation projection is shown to require milder regularity conditions than one based on inverse probability weighting.
This paper considers distributed statistical inference for general symmetric statistics %that encompasses the U-statistics and the M-estimators in the context of massive data where the data can be stored at multiple platforms in different locations. In order to facilitate effective computation and to avoid expensive communication among different platforms, we formulate distributed statistics which can be conducted over smaller data blocks. The statistical properties of the distributed statistics are investigated in terms of the mean square error of estimation and asymptotic distributions with respect to the number of data blocks. In addition, we propose two distributed bootstrap algorithms which are computationally effective and are able to capture the underlying distribution of the distributed statistics. Numerical simulation and real data applications of the proposed approaches are provided to demonstrate the empirical performance.
This paper presents and analyzes an approach to cluster-based inference for dependent data. The primary setting considered here is with spatially indexed data in which the dependence structure of observed random variables is characterized by a known, observed dissimilarity measure over spatial indices. Observations are partitioned into clusters with the use of an unsupervised clustering algorithm applied to the dissimilarity measure. Once the partition into clusters is learned, a cluster-based inference procedure is applied to a statistical hypothesis testing procedure. The procedure proposed in the paper allows the number of clusters to depend on the data, which gives researchers a principled method for choosing an appropriate clustering level. The paper gives conditions under which the proposed procedure asymptotically attains correct size. A simulation study shows that the proposed procedure attains near nominal size in finite samples in a variety of statistical testing problems with dependent data.
There exist a number of tests for assessing the nonparametric heteroscedastic location-scale assumption. Here we consider a goodness-of-fit test for the more general hypothesis of the validity of this model under a parametric functional transformation on the response variable. Specifically we consider testing for independence between the regressors and the errors in a model where the transformed response is just a location/scale shift of the error. Our criteria use the familiar factorization property of the joint characteristic function of the covariates under independence. The difficulty is that the errors are unobserved and hence one needs to employ properly estimated residuals in their place. We study the limit distribution of the test statistics under the null hypothesis as well as under alternatives, and also suggest a resampling procedure in order to approximate the critical values of the tests. This resampling is subsequently employed in a series of Monte Carlo experiments that illustrate the finite-sample properties of the new test. We also investigate the performance of related test statistics for normality and symmetry of errors, and apply our methods on real data sets.
In this paper, we survey some recent results on statistical inference (parametric and nonparametric statistical estimation, hypotheses testing) about the spectrum of stationary models with tapered data, as well as, a question concerning robustness of inferences, carried out on a linear stationary process contaminated by a small trend. We also discuss some question concerning tapered Toeplitz matrices and operators, central limit theorems for tapered Toeplitz type quadratic functionals, and tapered Fejer-type kernels and singular integrals. These are the main tools for obtaining the corresponding results, and also are of interest in themselves. The processes considered will be discrete-time and continuous-time Gaussian, linear or Levy-driven linear processes with memory.