No Arabic abstract
Accurate stochastic simulations of hourly precipitation are needed for impact studies at local spatial scales. Statistically, hourly precipitation data represent a difficult challenge. They are non-negative, skewed, heavy tailed, contain a lot of zeros (dry hours) and they have complex temporal structures (e.g., long persistence of dry episodes). Inspired by frailty-contagion approaches used in finance and insurance, we propose a multi-site precipitation simulator that, given appropriate regional atmospheric variables, can simultaneously handle dry events and heavy rainfall periods. One advantage of our model is its conceptual simplicity in its dynamical structure. In particular, the temporal variability is represented by a common factor based on a few classical atmospheric covariates like temperatures, pressures and others. Our inference approach is tested on simulated data and applied on measurements made in the northern part of French Brittany.
We focus on the problem of generalizing a causal effect estimated on a randomized controlled trial (RCT) to a target population described by a set of covariates from observational data. Available methods such as inverse propensity weighting are not designed to handle missing values, which are however common in both data sources. In addition to coupling the assumptions for causal effect identifiability and for the missing values mechanism and to defining appropriate estimation strategies, one difficulty is to consider the specific structure of the data with two sources and treatment and outcome only available in the RCT. We propose and compare three multiple imputation strategies (separate imputation, joint imputation with fixed effect, joint imputation without source information), as well as a technique that uses estimators that can handle missing values directly without imputing them. These methods are assessed in an extensive simulation study, showing the empirical superiority of fixed effect multiple imputation followed with any complete data generalizing estimators. This work is motivated by the analysis of a large registry of over 20,000 major trauma patients and a RCT studying the effect of tranexamic acid administration on mortality. The analysis illustrates how the missing values handling can impact the conclusion about the effect generalized from the RCT to the target population.
We introduce a class of semiparametric time series models by assuming a quasi-likelihood approach driven by a latent factor process. More specifically, given the latent process, we only specify the conditional mean and variance of the time series and enjoy a quasi-likelihood function for estimating parameters related to the mean. This proposed methodology has three remarkable features: (i) no parametric form is assumed for the conditional distribution of the time series given the latent process; (ii) able for modelling non-negative, count, bounded/binary and real-valued time series; (iii) dispersion parameter is not assumed to be known. Further, we obtain explicit expressions for the marginal moments and for the autocorrelation function of the time series process so that a method of moments can be employed for estimating the dispersion parameter and also parameters related to the latent process. Simulated results aiming to check the proposed estimation procedure are presented. Real data analysis on unemployment rate and precipitation time series illustrate the potencial for practice of our methodology.
Count data are collected in many scientific and engineering tasks including image processing, single-cell RNA sequencing and ecological studies. Such data sets often contain missing values, for example because some ecological sites cannot be reached in a certain year. In addition, in many instances, side information is also available, for example covariates about ecological sites or species. Low-rank methods are popular to denoise and impute count data, and benefit from a substantial theoretical background. Extensions accounting for covariates have been proposed, but to the best of our knowledge their theoretical and empirical properties have not been thoroughly studied, and few softwares are available for practitioners. We propose a complete methodology called LORI (Low-Rank Interaction), including a Poisson model, an algorithm, and automatic selection of the regularization parameter, to analyze count tables with covariates. We also derive an upper bound on the estimation error. We provide a simulation study with synthetic data, revealing empirically that LORI improves on state of the art methods in terms of estimation and imputation of the missing values. We illustrate how the method can be interpreted through visual displays with the analysis of a well-know plant abundance data set, and show that the LORI outputs are consistent with known results. Finally we demonstrate the relevance of the methodology by analyzing a water-birds abundance table from the French national agency for wildlife and hunting management (ONCFS). The method is available in the R package lori on the Comprehensive Archive Network (CRAN).
Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation (MI). Imputation of partially observed covariates is complicated if the substantive model is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of MI may impute covariates from models that are incompatible with such substantive models. We show how imputation by fully conditional specification, a popular approach for performing MI, can be modified so that covariates are imputed from models which are compatible with the substantive model. We investigate through simulation the performance of this proposal, and compare it to existing approaches. Simulation results suggest our proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible.
The penalized Cox proportional hazard model is a popular analytical approach for survival data with a large number of covariates. Such problems are especially challenging when covariates vary over follow-up time (i.e., the covariates are time-dependent). The standard R packages for fully penalized Cox models cannot currently incorporate time-dependent covariates. To address this gap, we implement a variant of gradient descent algorithm (proximal gradient descent) for fitting penalized Cox models. We apply our implementation to real and simulated data sets.