ترغب بنشر مسار تعليمي؟ اضغط هنا

Combining matching and linear regression: Introducing a mathematical framework and software for simulations, diagnostics and calibration

249   0   0.0 ( 0 )
 نشر من قبل Alireza Mahani
 تاريخ النشر 2015
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

Combining matching and regression for causal inference provides double-robustness in removing treatment effect estimation bias due to confounding variables. In most real-world applications, however, treatment and control populations are not large enough for matching to achieve perfect or near-perfect balance on all confounding variables and their nonlinear/interaction functions, leading to trade-offs. [this fact is independent of regression, so a bit disjointed from first sentence.] Furthermore, variance is as important of a contributor as bias towards total error in small samples, and must therefore be factored into the methodological decisions. In this paper, we develop a mathematical framework for quantifying the combined impact of matching and linear regression on bias and variance of treatment effect estimation. The framework includes expressions for bias and variance in a misspecified linear regression, theorems regarding impact of matching on bias and variance, and a constrained bias estimation approach for quantifying misspecification bias and combining it with variance to arrive at total error. Methodological decisions can thus be based on minimization of this total error, given the practitioners assumption/belief about an intuitive parameter, which we call `omitted R-squared. The proposed methodology excludes the outcome variable from analysis, thereby avoiding overfit creep and making it suitable for observational study designs. All core functions for bias and variance calculation, as well as diagnostic tools for bias-variance trade-off analysis, matching calibration, and power analysis are made available to researchers and practitioners through an open-source R library, MatchLinReg.

قيم البحث

اقرأ أيضاً

We introduce a new approach to a linear-circular regression problem that relates multiple linear predictors to a circular response. We follow a modeling approach of a wrapped normal distribution that describes angular variables and angular distributi ons and advances it for a linear-circular regression analysis. Some previous works model a circular variable as projection of a bivariate Gaussian random vector on the unit square, and the statistical inference of the resulting model involves complicated sampling steps. The proposed model treats circular responses as the result of the modulo operation on unobserved linear responses. The resulting model is a mixture of multiple linear-linear regression models. We present two EM algorithms for maximum likelihood estimation of the mixture model, one for a parametric model and another for a non-parametric model. The estimation algorithms provide a great trade-off between computation and estimation accuracy, which was numerically shown using five numerical examples. The proposed approach was applied to a problem of estimating wind directions that typically exhibit complex patterns with large variation and circularity.
Modern applications require methods that are computationally feasible on large datasets but also preserve statistical efficiency. Frequently, these two concerns are seen as contradictory: approximation methods that enable computation are assumed to d egrade statistical performance relative to exact methods. In applied mathematics, where much of the current theoretical work on approximation resides, the inputs are considered to be observed exactly. The prevailing philosophy is that while the exact problem is, regrettably, unsolvable, any approximation should be as small as possible. However, from a statistical perspective, an approximate or regularized solution may be preferable to the exact one. Regularization formalizes a trade-off between fidelity to the data and adherence to prior knowledge about the data-generating process such as smoothness or sparsity. The resulting estimator tends to be more useful, interpretable, and suitable as an input to other methods. In this paper, we propose new methodology for estimation and prediction under a linear model borrowing insights from the approximation literature. We explore these procedures from a statistical perspective and find that in many cases they improve both computational and statistical performance.
Can two separate case-control studies, one about Hepatitis disease and the other about Fibrosis, for example, be combined together? It would be hugely beneficial if two or more separately conducted case-control studies, even for entirely irrelevant p urposes, can be merged together with a unified analysis that produces better statistical properties, e.g., more accurate estimation of parameters. In this paper, we show that, when using the popular logistic regression model, the combined/integrative analysis produces a more accurate estimation of the slope parameters than the single case-control study. It is known that, in a single logistic case-control study, the intercept is not identifiable, contrary to prospective studies. In combined case-control studies, however, the intercepts are proved to be identifiable under mild conditions. The resulting maximum likelihood estimates of the intercepts and slopes are proved to be consistent and asymptotically normal, with asymptotic variances achieving the semiparametric efficiency lower bound.
There are many scenarios such as the electronic health records where the outcome is much more difficult to collect than the covariates. In this paper, we consider the linear regression problem with such a data structure under the high dimensionality. Our goal is to investigate when and how the unlabeled data can be exploited to improve the estimation and inference of the regression parameters in linear models, especially in light of the fact that such linear models may be misspecified in data analysis. In particular, we address the following two important questions. (1) Can we use the labeled data as well as the unlabeled data to construct a semi-supervised estimator such that its convergence rate is faster than the supervised estimators? (2) Can we construct confidence intervals or hypothesis tests that are guaranteed to be more efficient or powerful than the supervised estimators? To address the first question, we establish the minimax lower bound for parameter estimation in the semi-supervised setting. We show that the upper bound from the supervised estimators that only use the labeled data cannot attain this lower bound. We close this gap by proposing a new semi-supervised estimator which attains the lower bound. To address the second question, based on our proposed semi-supervised estimator, we propose two additional estimators for semi-supervised inference, the efficient estimator and the safe estimator. The former is fully efficient if the unknown conditional mean function is estimated consistently, but may not be more efficient than the supervised approach otherwise. The latter usually does not aim to provide fully efficient inference, but is guaranteed to be no worse than the supervised approach, no matter whether the linear model is correctly specified or the conditional mean function is consistently estimated.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا