ترغب بنشر مسار تعليمي؟ اضغط هنا

Simultaneous SNP identification in association studies with missing data

218   0   0.0 ( 0 )
 نشر من قبل George Casella
 تاريخ النشر 2012
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

Association testing aims to discover the underlying relationship between genotypes (usually Single Nucleotide Polymorphisms, or SNPs) and phenotypes (attributes, or traits). The typically large data sets used in association testing often contain missing values. Standard statistical methods either impute the missing values using relatively simple assumptions, or delete them, or both, which can generate biased results. Here we describe the Bayesian hierarchical model BAMD (Bayesian Association with Missing Data). BAMD is a Gibbs sampler, in which missing values are multiply imputed based upon all of the available information in the data set. We estimate the parameters and prove that updating one SNP at each iteration preserves the ergodic property of the Markov chain, and at the same time improves computational speed. We also implement a model selection option in BAMD, which enables potential detection of SNP interactions. Simulations show that unbiased estimates of SNP effects are recovered with missing genotype data. Also, we validate associations between SNPs and a carbon isotope discrimination phenotype that were previously reported using a family based method, and discover an additional SNP associated with the trait. BAMD is available as an R-package from http://cran.r-project.org/package=BAMD


قيم البحث

اقرأ أيضاً

We provide a view on high-dimensional statistical inference for genome-wide association studies (GWAS). It is in part a review but covers also new developments for meta analysis with multiple studies and novel software in terms of an R-package hierin f. Inference and assessment of significance is based on very high-dimensional multivariate (generalized) linear models: in contrast to often used marginal approaches, this provides a step towards more causal-oriented inference.
98 - Chen Xu , Yao Xie 2021
We develop a distribution-free, unsupervised anomaly detection method called ECAD, which wraps around any regression algorithm and sequentially detects anomalies. Rooted in conformal prediction, ECAD does not require data exchangeability but approxim ately controls the Type-I error when data are normal. Computationally, it involves no data-splitting and efficiently trains ensemble predictors to increase statistical power. We demonstrate the superior performance of ECAD on detecting anomalous spatio-temporal traffic flow.
Background: All-in-one station-based health monitoring devices are implemented in elder homes in Hong Kong to support the monitoring of vital signs of the elderly. During a pilot study, it was discovered that the systolic blood pressure was incorrect ly measured during multiple weeks. A real-time solution was needed to identify future data quality issues as soon as possible. Methods: Control charts are an effective tool for real-time monitoring and signaling issues (changes) in data. In this study, as in other healthcare applications, many observations are missing. Few methods are available for monitoring data with missing observations. A data quality monitoring method is developed to signal issues with the accuracy of the collected data quickly. This method has the ability to deal with missing observations. A Hotellings T-squared control chart is selected as the basis for our proposed method. Findings: The proposed method is retrospectively validated on a case study with a known measurement error in the systolic blood pressure measurements. The method is able to adequately detect this data quality problem. The proposed method was integrated into a personalized telehealth monitoring system and prospectively implemented in a second case study. It was found that the proposed scheme supports the control of data quality. Conclusions: Data quality is an important issue and control charts are useful for real-time monitoring of data quality. However, these charts must be adjusted to account for missing data that often occur in healthcare context.
109 - Junyan Liu , Sandeep Kumar , 2018
The autoregressive (AR) model is a widely used model to understand time series data. Traditionally, the innovation noise of the AR is modeled as Gaussian. However, many time series applications, for example, financial time series data, are non-Gaussi an, therefore, the AR model with more general heavy-tailed innovations is preferred. Another issue that frequently occurs in time series is missing values, due to system data record failure or unexpected data loss. Although there are numerous works about Gaussian AR time series with missing values, as far as we know, there does not exist any work addressing the issue of missing data for the heavy-tailed AR model. In this paper, we consider this issue for the first time, and propose an efficient framework for parameter estimation from incomplete heavy-tailed time series based on a stochastic approximation expectation maximization (SAEM) coupled with a Markov Chain Monte Carlo (MCMC) procedure. The proposed algorithm is computationally cheap and easy to implement. The convergence of the proposed algorithm to a stationary point of the observed data likelihood is rigorously proved. Extensive simulations and real datasets analyses demonstrate the efficacy of the proposed framework.
During the semiconductor manufacturing process, predicting the yield of the semiconductor is an important problem. Early detection of defective product production in the manufacturing process can save huge production cost. The data generated from the semiconductor manufacturing process have characteristics of highly non-normal distributions, complicated missing patterns and high missing rate, which complicate the prediction of the yield. We propose Dirichlet process - naive Bayes model (DPNB), a classification method based on the mixtures of Dirichlet process and naive Bayes model. Since the DPNB is based on the mixtures of Dirichlet process and learns the joint distribution of all variables involved, it can handle highly non-normal data and can make predictions for the test dataset with any missing patterns. The DPNB also performs well for high missing rates since it uses all information of observed components. Experiments on various real datasets including semiconductor manufacturing data show that the DPNB has better performance than MICE and MissForest in terms of predicting missing values as percentage of missing values increases.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا