أوراق بحثية, رسائل ماجستير ودكتوراه حول تطبيقات الإحصاء

Statistical tools used to identify scientific misconduct in mobile phone research (REFLEX program)

103 - Alexander Lerchl , Adalbert FX Wilhelm 2015

A severe case of scientific misconduct was discovered in a paper from 2005 allegedly showing harmful effects (DNA breakage) of non-thermal mobile phone electromagnetic field exposure on human and rat cells. Here we describe the way how the fraudulent data were identified. The low variations of the reported biological data are shown to be below theoretical lower limits (multinomial distributions). Another reason for doubts was highly significant non-equal distributions of last digits, a known hint towards data fabrication. The Medical University Vienna, where the research was conducted, was informed about these findings and came to the conclusion that the data in this and another, related paper by the same group were fabricated, and that both papers should be retracted.

تطبيقات الإحصاء

Estimating robustness of the tileShuffle method with repeated probes

104 - Sigrun Helga Lund , Asgeir Sigurdsson , Sigurjon Axel Gudjonsson 2014

In this paper the TileShuffle method is evaluated as a search method for candidate lncRNAs at 8q24.2. The method is run on three microarrays. Microarrays which all contained the same sample and repeated copies of tiled probes. This allows the coheren ce of the selection method within and between microarrays to be estimated by Monte Carlo simulations on the repeated probes.

تطبيقات الإحصاء الأساليب الكمية

Genetic Analysis of Transformed Phenotypes

144 - Nicolo Fusi , Christoph Lippert , Neil D. Lawrence 2014

Linear mixed models (LMMs) are a powerful and established tool for studying genotype-phenotype relationships. A limiting assumption of LMMs is that the residuals are Gaussian distributed, a requirement that rarely holds in practice. Violations of thi s assumption can lead to false conclusions and losses in power, and hence it is common practice to pre-process the phenotypic values to make them Gaussian, for instance by applying logarithmic or other non-linear transformations. Unfortunately, different phenotypes require different specific transformations, and choosing a good transformation is in general challenging and subjective. Here, we present an extension of the LMM that estimates an optimal transformation from the observed data. In extensive simulations and applications to real data from human, mouse and yeast we show that using such optimal transformations lead to increased power in genome-wide association studies and higher accuracy in heritability estimates and phenotype predictions.

الجينوم تطبيقات الإحصاء

Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references

184 - Lutz Bornmann , Ruediger Mutz 2014

Many studies in information science have looked at the growth of science. In this study, we re-examine the question of the growth of science. To do this we (i) use current data up to publication year 2012 and (ii) analyse it across all disciplines an d also separately for the natural sciences and for the medical and health sciences. Furthermore, the data are analysed with an advanced statistical technique - segmented regression analysis - which can identify specific segments with similar growth rates in the history of science. The study is based on two different sets of bibliometric data: (1) The number of publications held as source items in the Web of Science (WoS, Thomson Reuters) per publication year and (2) the number of cited references in the publications of the source items per cited reference year. We have looked at the rate at which science has grown since the mid-1600s. In our analysis of cited references we identified three growth phases in the development of science, which each led to growth rates tripling in comparison with the previous phase: from less than 1% up to the middle of the 18th century, to 2 to 3% up to the period between the two world wars and 8 to 9% to 2012.

المكتبات الرقمية الفيزياء والمجتمع تطبيقات الإحصاء

Identification and correction of sample mix-ups in expression genetic data: A case study

59 - Karl W. Broman , Mark P. Keller , Aimee Teo Broman 2014

In a mouse intercross with more than 500 animals and genome-wide gene expression data on six tissues, we identified a high proportion (18%) of sample mix-ups in the genotype data. Local expression quantitative trait loci (eQTL; genetic loci influenci ng gene expression) with extremely large effect were used to form a classifier to predict an individuals eQTL genotype based on expression data alone. By considering multiple eQTL and their related transcripts, we identified numerous individuals whose predicted eQTL genotypes (based on their expression data) did not match their observed genotypes, and then went on to identify other individuals whose genotypes did match the predicted eQTL genotypes. The concordance of predictions across six tissues indicated that the problem was due to mix-ups in the genotypes (though we further identified a small number of sample mix-ups in each of the six panels of gene expression microarrays). Consideration of the plate positions of the DNA samples indicated a number of off-by-one and off-by-two errors, likely the result of pipetting errors. Such sample mix-ups can be a problem in any genetic study, but eQTL data allow us to identify, and even correct, such problems. Our methods have been implemented in an R package, R/lineup.

تطبيقات الإحصاء

A model-free estimation for the covariate-adjusted Youden index and its associated cut-point

56 - Tu Xu , Junhui Wang , Yixin Fang 2014

In medical research, continuous markers are widely employed in diagnostic tests to distinguish diseased and non-diseased subjects. The accuracy of such diagnostic tests is commonly assessed using the receiver operating characteristic (ROC) curve. To summarize an ROC curve and determine its optimal cut-point, the Youden index is popularly used. In literature, estimation of the Youden index has been widely studied via various statistical modeling strategies on the conditional density. This paper proposes a new model-free estimation method, which directly estimates the covariate-adjusted cut-point without estimating the conditional density. Consequently, covariate-adjusted Youden index can be estimated based on the estimated cutpoint. The proposed method formulates the estimation problem in a large margin classification framework, which allows flexible modeling of the covariate-adjusted Youden index through kernel machines. The advantage of the proposed method is demonstrated in a variety of simulated experiments as well as a real application to Pima Indians diabetes study.

تطبيقات الإحصاء

When Data do not Bring Information: A Case Study in Markov Random Fields Estimation

53 - J. Gimenez , A.C. Frery , Ana Georgina Flesia 2014

The Potts model is frequently used to describe the behavior of image classes, since it allows to incorporate contextual information linking neighboring pixels in a simple way. Its isotropic version has only one real parameter beta, known as smoothnes s parameter or inverse temperature, which regulates the classes map homogeneity. The classes are unavailable, and estimating them is central in important image processing procedures as, for instance, image classification. Methods for estimating the classes which stem from a Bayesian approach under the Potts model require to adequately specify a value for beta. The estimation of such parameter can be efficiently made solving the Pseudo Maximum likelihood (PML) equations in two different schemes, using the prior or the posterior model. Having only radiometric data available, the first scheme needs the computation of an initial segmentation, while the second uses both the segmentation and the radiometric data to make the estimation. In this paper, we compare these two PML estimators by computing the mean square error (MSE), bias, and sensitivity to deviations from the hypothesis of the model. We conclude that the use of extra data does not improve the accuracy of the PML, moreover, under gross deviations from the model, this extra information introduces unpredictable distortions and bias.

تطبيقات الإحصاء

IsoDOT Detects Differential RNA-isoform Expression/Usage with respect to a Categorical or Continuous Covariate with High Sensitivity and Specificity

273 - Wei Sun , Yufeng Liu , James J. Crowley 2014

We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total expression of the co rresponding gene. IsoDOT performs two tasks that cannot be accomplished by existing methods: to test DIE/DIU with respect to a continuous covariate, and to test DIE/DIU for one case versus one control. The latter task is not an uncommon situation in practice, e.g., comparing paternal and maternal allele of one individual or comparing tumor and normal sample of one cancer patient. Simulation studies demonstrate the high sensitivity and specificity of IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on mouse transcriptome and identify a group of genes whose isoform usages respond to haloperidol treatment.

تطبيقات الإحصاء الأساليب الكمية المنهجية

Regularized brain reading with shrinkage and smoothing

134 - Leila Wehbe , Aaditya Ramdas , Rebecca C. Steorts 2014

Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. We com pare a suite of approaches which regularize via shrinkage: ridge regression, the elastic net (a generalization of ridge regression and the lasso), and a hierarchical Bayesian model based on small area estimation (SAE). We contrast regularization with spatial smoothing and combinations of smoothing and shrinkage. All methods are tested on functional magnetic resonance imaging (fMRI) data from multiple subjects participating in two different experiments related to reading, for both predicting neural response to stimuli and decoding stimuli from responses. Interestingly, when the regularization parameters are chosen by cross-validation independently for every voxel, low/high regularization is chosen in voxels where the classification accuracy is high/low, indicating that the regularization intensity is a good tool for identification of relevant voxels for the cognitive task. Surprisingly, all the regularization methods work about equally well, suggesting that beating basic smoothing and shrinkage will take not only clever methods, but also careful modeling.

تطبيقات الإحصاء

MPAgenomics : An R package for multi-patients analysis of genomic markers

335 - Quentin Grimonprez , Meyling Cheok 2014

MPAgenomics, standing for multi-patients analysis (MPA) of genomic markers, is an R-package devoted to: (i) efficient segmentation, and (ii) genomic marker selection from multi-patient copy number and SNP data profiles. It provides wrappers from comm only used packages to facilitate their repeated (sometimes difficult) use, offering an easy-to-use pipeline for beginners in R. The segmentation of successive multiple profiles (finding losses and gains) is based on a new automatic choice of influential parameters since default ones were misleading in the original packages. Considering multiple profiles in the same time, MPAgenomics wraps efficient penalized regression methods to select relevant markers associated with a given response.

الأساليب الكمية تطبيقات الإحصاء

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد