No Arabic abstract
Due to concerns about parametric model misspecification, there is interest in using machine learning to adjust for confounding when evaluating the causal effect of an exposure on an outcome. Unfortunately, exposure effect estimators that rely on machine learning predictions are generally subject to so-called plug-in bias, which can render naive p-values and confidence intervals invalid. Progress has been made via proposals like targeted maximum likelihood estimation and more recently double machine learning, which rely on learning the conditional mean of both the outcome and exposure. Valid inference can then be obtained so long as both predictions converge (sufficiently fast) to the truth. Focusing on partially linear regression models, we show that a specific implementation of the machine learning techniques can yield exposure effect estimators that have small bias even when one of the first-stage predictions does not converge to the truth. The resulting tests and confidence intervals are doubly robust. We also show that the proposed estimators may fail to be regular when only one nuisance parameter is consistently estimated; nevertheless, we observe in simulation studies that our proposal leads to reduced bias and improved confidence interval coverage in moderate samples.
While model selection is a well-studied topic in parametric and nonparametric regression or density estimation, selection of possibly high-dimensional nuisance parameters in semiparametric problems is far less developed. In this paper, we propose a selective machine learning framework for making inferences about a finite-dimensional functional defined on a semiparametric model, when the latter admits a doubly robust estimating function and several candidate machine learning algorithms are available for estimating the nuisance parameters. We introduce two new selection criteria for bias reduction in estimating the functional of interest, each based on a novel definition of pseudo-risk for the functional that embodies the double robustness property and thus is used to select the pair of learners that is nearest to fulfilling this property. We establish an oracle property for a multi-fold cross-validation version of the new selection criteria which states that our empirical criteria perform nearly as well as an oracle with a priori knowledge of the pseudo-risk for each pair of candidate learners. We also describe a smooth approximation to the selection criteria which allows for valid post-selection inference. Finally, we apply the approach to model selection of a semiparametric estimator of average treatment effect given an ensemble of candidate machine learners to account for confounding in an observational study.
Frequentist inference has a well-established supporting theory for doubly robust causal inference based on the potential outcomes framework, which is realized via outcome regression (OR) and propensity score (PS) models. The Bayesian counterpart, however, is not obvious as the PS model loses its balancing property in joint modeling. In this paper, we propose a natural and formal Bayesian solution by bridging loss-type Bayesian inference with a utility function derived from the notion of a pseudo-population via the change of measure. Consistency of the posterior distribution is shown with correctly specified and misspecified OR models. Simulation studies suggest that our proposed method can estimate the true causal effect more efficiently and achieve the frequentist coverage if either the OR model is correctly specified or fit with a flexible function of the confounders, compared to the previous Bayesian approach via the Bayesian bootstrap. Finally, we apply this novel Bayesian method to assess the impact of speed cameras on the reduction of car collisions in England.
A large number of statistical models are doubly-intractable: the likelihood normalising term, which is a function of the model parameters, is intractable, as well as the marginal likelihood (model evidence). This means that standard inference techniques to sample from the posterior, such as Markov chain Monte Carlo (MCMC), cannot be used. Examples include, but are not confined to, massive Gaussian Markov random fields, autologistic models and Exponential random graph models. A number of approximate schemes based on MCMC techniques, Approximate Bayesian computation (ABC) or analytic approximations to the posterior have been suggested, and these are reviewed here. Exact MCMC schemes, which can be applied to a subset of doubly-intractable distributions, have also been developed and are described in this paper. As yet, no general method exists which can be applied to all classes of models with doubly-intractable posteriors. In addition, taking inspiration from the Physics literature, we study an alternative method based on representing the intractable likelihood as an infinite series. Unbiased estimates of the likelihood can then be obtained by finite time stochastic truncation of the series via Russian Roulette sampling, although the estimates are not necessarily positive. Results from the Quantum Chromodynamics literature are exploited to allow the use of possibly negative estimates in a pseudo-marginal MCMC scheme such that expectations with respect to the posterior distribution are preserved. The methodology is reviewed on well-known examples such as the parameters in Ising models, the posterior for Fisher-Bingham distributions on the $d$-Sphere and a large-scale Gaussian Markov Random Field model describing the Ozone Column data. This leads to a critical assessment of the strengths and weaknesses of the methodology with pointers to ongoing research.
This paper investigates the problem of making inference about a parametric model for the regression of an outcome variable $Y$ on covariates $(V,L)$ when data are fused from two separate sources, one which contains information only on $(V, Y)$ while the other contains information only on covariates. This data fusion setting may be viewed as an extreme form of missing data in which the probability of observing complete data $(V,L,Y)$ on any given subject is zero. We have developed a large class of semiparametric estimators, which includes doubly robust estimators, of the regression coefficients in fused data. The proposed method is DR in that it is consistent and asymptotically normal if, in addition to the model of interest, we correctly specify a model for either the data source process under an ignorability assumption, or the distribution of unobserved covariates. We evaluate the performance of our various estimators via an extensive simulation study, and apply the proposed methods to investigate the relationship between net asset value and total expenditure among U.S. households in 1998, while controlling for potential confounders including income and other demographic variables.
This paper derives time-uniform confidence sequences (CS) for causal effects in experimental and observational settings. A confidence sequence for a target parameter $psi$ is a sequence of confidence intervals $(C_t)_{t=1}^infty$ such that every one of these intervals simultaneously captures $psi$ with high probability. Such CSs provide valid statistical inference for $psi$ at arbitrary stopping times, unlike classical fixed-time confidence intervals which require the sample size to be fixed in advance. Existing methods for constructing CSs focus on the nonasymptotic regime where certain assumptions (such as known bounds on the random variables) are imposed, while doubly robust estimators of causal effects rely on (asymptotic) semiparametric theory. We use sequenti