No Arabic abstract
Causal discovery algorithms estimate causal graphs from observational data. This can provide a valuable complement to analyses focussing on the causal relation between individual treatment-outcome pairs. Constraint-based causal discovery algorithms rely on conditional independence testing when building the graph. Until recently, these algorithms have been unable to handle missing values. In this paper, we investigate two alternative solutions: Test-wise deletion and multiple imputation. We establish necessary and sufficient conditions for the recoverability of causal structures under test-wise deletion, and argue that multiple imputation is more challenging in the context of causal discovery than for estimation. We conduct an extensive comparison by simulating from benchmark causal graphs: As one might expect, we find that test-wise deletion and multiple imputation both clearly outperform list-wise deletion and single imputation. Crucially, our results further suggest that multiple imputation is especially useful in settings with a small number of either Gaussian or discrete variables, but when the dataset contains a mix of both neither method is uniformly best. The methods we compare include random forest imputation and a hybrid procedure combining test-wise deletion and multiple imputation. An application to data from the IDEFICS cohort study on diet- and lifestyle-related diseases in European children serves as an illustrating example.
In this guide, we present how to perform constraint-based causal discovery using three popular software packages: pcalg (with add-ons tpc and micd), bnlearn, and TETRAD. We focus on how these packages can be used with observational data and in the presence of mixed data (i.e., data where some variables are continuous, while others are categorical), a known time ordering between variables, and missing data. Throughout, we point out the relative strengths and limitations of each package, as well as give practical recommendations. We hope this guide helps anyone who is interested in performing constraint-based causal discovery on their data.
In a comprehensive cohort study of two competing treatments (say, A and B), clinically eligible individuals are first asked to enroll in a randomized trial and, if they refuse, are then asked to enroll in a parallel observational study in which they can choose treatment according to their own preference. We consider estimation of two estimands: (1) comprehensive cohort causal effect -- the difference in mean potential outcomes had all patients in the comprehensive cohort received treatment A vs. treatment B and (2) randomized trial causal effect -- the difference in mean potential outcomes had all patients enrolled in the randomized trial received treatment A vs. treatment B. For each estimand, we consider inference under various sets of unconfoundedness assumptions and construct semiparametric efficient and robust estimators. These estimators depend on nuisance functions, which we estimate, for illustrative purposes, using generalized additive models. Using the theory of sample splitting, we establish the asymptotic properties of our proposed estimators. We also illustrate our methodology using data from the Bypass Angioplasty Revascularization Investigation (BARI) randomized trial and observational registry to evaluate the effect of percutaneous transluminal coronary balloon angioplasty versus coronary artery bypass grafting on 5-year mortality. To evaluate the finite sample performance of our estimators, we use the BARI dataset as the basis of a realistic simulation study.
Modern RNA sequencing technologies provide gene expression measurements from single cells that promise refined insights on regulatory relationships among genes. Directed graphical models are well-suited to explore such (cause-effect) relationships. However, statistical analyses of single cell data are complicated by the fact that the data often show zero-inflated expression patterns. To address this challenge, we propose directed graphical models that are based on Hurdle conditional distributions parametrized in terms of polynomials in parent variables and their 0/1 indicators of being zero or nonzero. While directed graphs for Gaussian models are only identifiable up to an equivalence class in general, we show that, under a natural and weak assumption, the exact directed acyclic graph of our zero-inflated models can be identified. We propose methods for graph recovery, apply our model to real single-cell RNA-seq data on T helper cells, and show simulated experiments that validate the identifiability and graph estimation methods in practice.
The problem of missing values in multivariable time series is a key challenge in many applications such as clinical data mining. Although many imputation methods show their effectiveness in many applications, few of them are designed to accommodate clinical multivariable time series. In this work, we propose a multiple imputation model that capture both cross-sectional information and temporal correlations. We integrate Gaussian processes with mixture models and introduce individualized mixing weights to handle the variance of predictive confidence of Gaussian process models. The proposed model is compared with several state-of-the-art imputation algorithms on both real-world and synthetic datasets. Experiments show that our best model can provide more accurate imputation than the benchmarks on all of our datasets.
Due to complex experimental settings, missing values are common in biomedical data. To handle this issue, many methods have been proposed, from ignoring incomplete instances to various data imputation approaches. With the recent rise of deep neural networks, the field of missing data imputation has oriented towards modelling of the data distribution. This paper presents an approach based on Monte Carlo dropout within (Variational) Autoencoders which offers not only very good adaptation to the distribution of the data but also allows generation of new data, adapted to each specific instance. The evaluation shows that the imputation error and predictive similarity can be improved with the proposed approach.