No Arabic abstract
Coarse structural nested mean models are used to estimate treatment effects from longitudinal observational data. Coarse structural nested mean models lead to a large class of estimators. It turns out that estimates and standard errors may differ considerably within this class. We prove that, under additional assumptions, there exists an explicit solution for the optimal estimator within the class of coarse structural nested mean models. Moreover, we show that even if the additional assumptions do not hold, this optimal estimator is doubly-robust: it is consistent and asymptotically normal not only if the model for treatment initiation is correct, but also if a certain outcome-regression model is correct. We compare the optimal estimator to some naive choices within the class of coarse structural nested mean models in a simulation study. Furthermore, we apply the optimal and naive estimators to study how the CD4 count increase due to one year of antiretroviral treatment (ART) depends on the time between HIV infection and ART initiation in recently infected HIV infected patients. Both in the simulation study and in the application, the use of optimal estimators leads to substantial increases in precision.
The cross-classified sampling design consists in drawing samples from a two-dimension population, independently in each dimension. Such design is commonly used in consumer price index surveys and has been recently applied to draw a sample of babies in the French ELFE survey, by crossing a sample of maternity units and a sample of days. We propose to derive a general theory of estimation for this sampling design. We consider the Horvitz-Thompson estimator for a total, and show that the cross-classified design will usually result in a loss of efficiency as compared to the widespread two-stage design. We obtain the asymptotic distribution of the Horvitz-Thompson estimator, and several unbiased variance estimators. Facing the problem of possibly negative values, we propose simplified non-negative variance estimators and study their bias under a super-population model. The proposed estimators are compared for totals and ratios on simulated data. An application on real data from the ELFE survey is also presented, and we make some recommendations. Supplementary materials are available online.
This paper focuses on the time series generated by the event counts of stationary Hawkes processes. When the exact locations of points are not observed, but only counts over time intervals of fixed size, existing methods of estimation are not applicable. We first establish a strong mixing condition with polynomial decay rate for Hawkes processes, from their Poisson cluster structure. This allows us to propose a spectral approach to the estimation of Hawkes processes, based on Whittles method, which provides consistent and asymptotically normal estimates under common regularity conditions on their reproduction kernels. Simulated datasets and a case-study illustrate the performances of the estimation, notably of the Hawkes reproduction mean and kernel when time intervals are relatively large.
We propose a Bayesian approach, called the posterior spectral embedding, for estimating the latent positions in random dot product graphs, and prove its optimality. Unlike the classical spectral-based adjacency/Laplacian spectral embedding, the posterior spectral embedding is a fully-likelihood based graph estimation method taking advantage of the Bernoulli likelihood information of the observed adjacency matrix. We develop a minimax-lower bound for estimating the latent positions, and show that the posterior spectral embedding achieves this lower bound since it both results in a minimax-optimal posterior contraction rate, and yields a point estimator achieving the minimax risk asymptotically. The convergence results are subsequently applied to clustering in stochastic block models, the result of which strengthens an existing result concerning the number of mis-clustered vertices. We also study a spectral-based Gaussian spectral embedding as a natural Bayesian analogy of the adjacency spectral embedding, but the resulting posterior contraction rate is sub-optimal with an extra logarithmic factor. The practical performance of the proposed methodology is illustrated through extensive synthetic examples and the analysis of a Wikipedia graph data.
This paper studies the generalization of the targeted minimum loss-based estimation (TMLE) framework to estimation of effects of time-varying interventions in settings where both interventions, covariates, and outcome can happen at subject-specific time-points on an arbitrarily fine time-scale. TMLE is a general template for constructing asymptotically linear substitution estimators for smooth low-dimensional parameters in infinite-dimensional models. Existing longitudinal TMLE methods are developed for data where observations are made on a discrete time-grid. We consider a continuous-time counting process model where intensity measures track the monitoring of subjects, and focus on a low-dimensional target parameter defined as the intervention-specific mean outcome at the end of follow-up. To construct our TMLE algorithm for the given statistical estimation problem we derive an expression for the efficient influence curve and represent the target parameter as a functional of intensities and conditional expectations. The high-dimensional nuisance parameters of our model are estimated and updated in an iterative manner according to separate targeting steps for the involved intensities and conditional expectations. The resulting estimator solves the efficient influence curve equation. We state a general efficiency theorem and describe a highly adaptive lasso estimator for nuisance parameters that allows us to establish asymptotic linearity and efficiency of our estimator under minimal conditions on the underlying statistical model.
We consider high-dimensional measurement errors with high-frequency data. Our focus is on recovering the covariance matrix of the random errors with optimality. In this problem, not all components of the random vector are observed at the same time and the measurement errors are latent variables, leading to major challenges besides high data dimensionality. We propose a new covariance matrix estimator in this context with appropriate localization and thresholding. By developing a new technical device integrating the high-frequency data feature with the conventional notion of $alpha$-mixing, our analysis successfully accommodates the challenging serial dependence in the measurement errors. Our theoretical analysis establishes the minimax optimal convergence rates associated with two commonly used loss functions. We then establish cases when the proposed localized estimator with thresholding achieves the minimax optimal convergence rates. Considering that the variances and covariances can be small in reality, we conduct a second-order theoretical analysis that further disentangles the dominating bias in the estimator. A bias-corrected estimator is then proposed to ensure its practical finite sample performance. We illustrate the promising empirical performance of the proposed estimator with extensive simulation studies and a real data analysis.