ترغب بنشر مسار تعليمي؟ اضغط هنا

Coping with Selection Effects: A Primer on Regression with Truncated Data

46   0   0.0 ( 0 )
 نشر من قبل Adam Mantz
 تاريخ النشر 2019
  مجال البحث فيزياء
والبحث باللغة English
 تأليف Adam B. Mantz




اسأل ChatGPT حول البحث

The finite sensitivity of instruments or detection methods means that data sets in many areas of astronomy, for example cosmological or exoplanet surveys, are necessarily systematically incomplete. Such data sets, where the population being investigated is of unknown size and only partially represented in the data, are called truncated in the statistical literature. Truncation can be accounted for through a relatively straightforward modification to the model being fitted in many circumstances, provided that the model can be extended to describe the population of undetected sources. Here I examine the problem of regression using truncated data in general terms, and use a simple example to show the impact of selecting a subset of potential data on the dependent variable, on the independent variable, and on a second dependent variable that is correlated with the variable of interest. Special circumstances in which selection effects are ignorable are noted. I also comment on computational strategies for performing regression with truncated data, as an extension of methods that have become popular for the non-truncated case, and provide some general recommendations.



قيم البحث

اقرأ أيضاً

We study the causal interpretation of regressions on multiple dependent treatments and flexible controls. Such regressions are often used to analyze randomized control trials with multiple intervention arms, and to estimate institutional quality (e.g . teacher value-added) with observational data. We show that, unlike with a single binary treatment, these regressions do not generally estimate convex averages of causal effects-even when the treatments are conditionally randomly assigned and the controls fully address omitted variables bias. We discuss different solutions to this issue, and propose as a solution anew class of efficient estimators of weighted average treatment effects.
Statistical studies of astronomical data sets, in particular of cataloged properties for discrete objects, are central to astrophysics. One cannot model those objects population properties or incidences without a quantitative understanding of the con ditions under which these objects ended up in a catalog or sample, the samples selection function. As systematic and didactic introductions to this topic are scarce in the astrophysical literature, we aim to provide one, addressing generically the following questions: What is a selection function? What arguments $vec{q}$ should a selection function depend on? Over what domain must a selection function be defined? What approximations and simplifications can be made? And, how is a selection function used in `modelling? We argue that volume-complete samples, with the volume drastically curtailed by the faintest objects, reflect a highly sub-optimal selection function that needlessly reduces the number of bright and usually rare objects in the sample. We illustrate these points by a worked example, deriving the space density of white dwarfs (WD) in the Galactic neighbourhood as a function of their luminosity and Gaia color, $Phi_0(M_G,B-R)$ in [mag$^{-2}$pc$^{-3}$]. We construct a sample of $10^5$ presumed WDs through straightforward selection cuts on the Gaia EDR3 catalog, in magnitude, color, parallax, and astrometric fidelity $vec{q}=(m_G,B-R,varpi,p_{af})$. We then combine a simple model for $Phi_0$ with the effective survey volume derived from this selection function $S_C(vec{q})$ to derive a detailed and robust estimate of $Phi_0(M_G,B-R)$. This resulting white dwarf luminosity-color function $Phi_0(M_G,B-R)$ differs dramatically from the initial number density distribution in the luminosity-color plane: by orders of magnitude in density and by four magnitudes in density peak location.
96 - Jake VanderPlas 2014
This paper presents a brief, semi-technical comparison of the essential features of the frequentist and Bayesian approaches to statistical inference, with several illustrative examples implemented in Python. The differences between frequentism and Ba yesianism fundamentally stem from differing definitions of probability, a philosophical divide which leads to distinct approaches to the solution of statistical problems as well as contrasting ways of asking and answering questions about unknown parameters. After an example-driven discussion of these differences, we briefly compare several leading Python statistical packages which implement frequentist inference using classical methods and Bayesian inference using Markov Chain Monte Carlo.
Linear Mixed Effects (LME) models have been widely applied in clustered data analysis in many areas including marketing research, clinical trials, and biomedical studies. Inference can be conducted using maximum likelihood approach if assuming Normal distributions on the random effects. However, in many applications of economy, business and medicine, it is often essential to impose constraints on the regression parameters after taking their real-world interpretations into account. Therefore, in this paper we extend the classical (unconstrained) LME models to allow for sign constraints on its overall coefficients. We propose to assume a symmetric doubly truncated Normal (SDTN) distribution on the random effects instead of the unconstrained Normal distribution which is often found in classical literature. With the aforementioned change, difficulty has dramatically increased as the exact distribution of the dependent variable becomes analytically intractable. We then develop likelihood-based approaches to estimate the unknown model parameters utilizing the approximation of its exact distribution. Simulation studies have shown that the proposed constrained model not only improves real-world interpretations of results, but also achieves satisfactory performance on model fits as compared to the existing model.
A number of experiments are set to measure the 21-cm signal of neutral hydrogen from the Epoch of Reionization (EoR). The common denominator of these experiments are the large data sets produced, contaminated by various instrumental effects, ionosphe ric distortions, RFI and strong Galactic and extragalactic foregrounds. In this paper, the first in a series, we present the Data Model that will be the basis of the signal analysis for the LOFAR (Low Frequency Array) EoR Key Science Project (LOFAR EoR KSP). Using this data model we simulate realistic visibility data sets over a wide frequency band, taking properly into account all currently known instrumental corruptions (e.g. direction-dependent gains, complex gains, polarization effects, noise, etc). We then apply primary calibration errors to the data in a statistical sense, assuming that the calibration errors are random Gaussian variates at a level consistent with our current knowledge based on observations with the LOFAR Core Station 1. Our aim is to demonstrate how the systematics of an interferometric measurement affect the quality of the calibrated data, how errors correlate and propagate, and in the long run how this can lead to new calibration strategies. We present results of these simulations and the inversion process and extraction procedure. We also discuss some general properties of the coherency matrix and Jones formalism that might prove useful in solving the calibration problem of aperture synthesis arrays. We conclude that even in the presence of realistic noise and instrumental errors, the statistical signature of the EoR signal can be detected by LOFAR with relatively small errors. A detailed study of the statistical properties of our data model and more complex instrumental models will be considered in the future.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا