No Arabic abstract
Inverse probability of treatment weighting (IPTW) is a popular propensity score (PS)-based approach to estimate causal effects in observational studies at risk of confounding bias. A major issue when estimating the PS is the presence of partially observed covariates. Multiple imputation (MI) is a natural approach to handle missing data on covariates, but its use in the PS context raises three important questions: (i) should we apply Rubins rules to the IPTW treatment effect estimates or to the PS estimates themselves? (ii) does the outcome have to be included in the imputation model? (iii) how should we estimate the variance of the IPTW estimator after MI? We performed a simulation study focusing on the effect of a binary treatment on a binary outcome with three confounders (two of them partially observed). We used MI with chained equations to create complete datasets and compared three ways of combining the results: combining treatment effect estimates (MIte); combining the PS across the imputed datasets (MIps); or combining the PS parameters and estimating the PS of the average covariates across the imputed datasets (MIpar). We also compared the performance of these methods to complete case (CC) analysis and the missingness pattern (MP) approach, a method which uses a different PS model for each pattern of missingness. We also studied empirically the consistency of these 3 MI estimators. Under a missing at random (MAR) mechanism, CC and MP analyses were biased in most cases when estimating the marginal treatment effect, whereas MI approaches had good performance in reducing bias as long as the outcome was included in the imputation model. However, only MIte was unbiased in all the studied scenarios and Rubins rules provided good variance estimates for MIte.
Propensity score matching (PSM) has been widely used to mitigate confounding in observational studies, although complications arise when the covariates used to estimate the PS are only partially observed. Multiple imputation (MI) is a potential solution for handling missing covariates in the estimation of the PS. Unfortunately, it is not clear how to best apply MI strategies in the context of PSM. We conducted a simulation study to compare the performances of popular non-MI missing data methods and various MI-based strategies under different missing data mechanisms (MDMs). We found that commonly applied missing data methods resulted in biased and inefficient estimates, and we observed large variation in performance across MI-based strategies. Based on our findings, we recommend 1) deriving the PS after applying MI (referred to as MI-derPassive); 2) conducting PSM within each imputed data set followed by averaging the treatment effects to arrive at one summarized finding (INT-within) for mild MDMs and averaging the PSs across multiply imputed datasets before obtaining one treatment effect using PSM (INT-across) for more complex MDMs; 3) a bootstrapped-based variance to account for uncertainty of PS estimation, matching, and imputation; and 4) inclusion of key auxiliary variables in the imputation model.
Selective inference (post-selection inference) is a methodology that has attracted much attention in recent years in the fields of statistics and machine learning. Naive inference based on data that are also used for model selection tends to show an overestimation, and so the selective inference conditions the event that the model was selected. In this paper, we develop selective inference in propensity score analysis with a semiparametric approach, which has become a standard tool in causal inference. Specifically, for the most basic causal inference model in which the causal effect can be written as a linear sum of confounding variables, we conduct Lasso-type variable selection by adding an $ell_1$ penalty term to the loss function that gives a semiparametric estimator. Confidence intervals are then given for the coefficients of the selected confounding variables, conditional on the event of variable selection, with asymptotic guarantees. An important property of this method is that it does not require modeling of nonparametric regression functions for the outcome variables, as is usually the case with semiparametric propensity score analysis.
Field-normalization of citations is bibliometric standard. Despite the observed differences in citation counts between fields, the question remains how strong fields influence citation rates beyond the effect of attributes or factors possibly influencing citations (FICs). We considered several FICs such as number of pages and number of co-authors in this study. We wondered whether there is a separate field-effect besides other effects (e.g., from numbers of pages and co-authors). To find an answer on the question in this study, we applied inverse-probability of treatment weighting (IPW). Using Web of Science data (a sample of 308,231 articles), we investigated whether mean differences among subject categories in citation rates still remain, even if the subject categories are made comparable in the field-related attributes (e.g., comparable of co-authors, comparable number of pages) by IPW. In a diagnostic step of our statistical analyses, we considered propensity scores as covariates in regression analyses to examine whether the differences between the fields in FICs vanish. The results revealed that the differences did not completely vanish but were strongly reduced. We received similar results when we calculated mean value differences of the fields after IPW representing the causal or unconfounded field effects on citations. However, field differences in citation rates remain. The results point out that field-normalization seems to be a prerequisite for citation analysis and cannot be replaced by the consideration of any set of FICs in citation analyses.
Researchers often impute continuous variables under an assumption of normality, yet many incomplete variables are skewed. We find that imputing skewed continuous variables under a normal model can lead to bias; the bias is usually mild for popular estimands such as means, standard deviations, and linear regression coefficients, but the bias can be severe for more shape-dependent estimands such as percentiles or the coefficient of skewness. We test several methods for adapting a normal imputation model to accommodate skewness, including methods that transform, truncate, or censor (round) normally imputed values, as well as methods that impute values from a quadratic or truncated regression. None of these modifications reliably reduces the biases of the normal model, and some modifications can make the biases much worse. We conclude that, if one has to impute a skewed variable under a normal model, it is usually safest to do so without modifications -- unless you are more interested in estimating percentiles and shape that in estimated means, variance, and regressions. In the conclusion, we briefly discuss promising developments in the area of continuous imputation models that do not assume normality.
Understanding how treatment effects vary on individual characteristics is critical in the contexts of personalized medicine, personalized advertising and policy design. When the characteristics are of practical interest are only a subset of full covariate, non-parametric estimation is often desirable; but few methods are available due to the computational difficult. Existing non-parametric methods such as the inverse probability weighting methods have limitations that hinder their use in many practical settings where the values of propensity scores are close to 0 or 1. We propose the propensity score regression (PSR) that allows the non-parametric estimation of the heterogeneous treatment effects in a wide context. PSR includes two non-parametric regressions in turn, where it first regresses on the propensity scores together with the characteristics of interest, to obtain an intermediate estimate; and then, regress the intermediate estimates on the characteristics of interest only. By including propensity scores as regressors in the non-parametric manner, PSR is capable of substantially easing the computational difficulty while remain (locally) insensitive to any value of propensity scores. We present several appealing properties of PSR, including the consistency and asymptotical normality, and in particular the existence of an explicit variance estimator, from which the analytical behaviour of PSR and its precision can be assessed. Simulation studies indicate that PSR outperform existing methods in varying settings with extreme values of propensity scores. We apply our method to the national 2009 flu survey (NHFS) data to investigate the effects of seasonal influenza vaccination and having paid sick leave across different age groups.