ترغب بنشر مسار تعليمي؟ اضغط هنا

A Critique of Differential Abundance Analysis, and Advocacy for an Alternative

112   0   0.0 ( 0 )
 نشر من قبل Thomas P Quinn
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

It is largely taken for granted that differential abundance analysis is, by default, the best first step when analyzing genomic data. We argue that this is not necessarily the case. In this article, we identify key limitations that are intrinsic to differential abundance analysis: it is (a) dependent on unverifiable assumptions, (b) an unreliable construct, and (c) overly reductionist. We formulate an alternative framework called ratio-based biomarker analysis which does not suffer from the identified limitations. Moreover, ratio-based biomarkers are highly flexible. Beyond replacing DAA, they can also be used for many other bespoke analyses, including dimension reduction and multi-omics data integration.



قيم البحث

اقرأ أيضاً

127 - Yan Zhou , Jiadi Zhu , Tiejun Tong 2018
Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcript ional responses. To remove systematic variation between different species for a fair comparison, the normalization procedure serves as a crucial pre-processing step that adjusts for the varying sample sequencing depths and other confounding technical effects. Results: In this paper, we propose a scale based normalization (SCBN) method by taking into account the available knowledge of conserved orthologous genes and hypothesis testing framework. Considering the different gene lengths and unmapped genes between different species, we formulate the problem from the perspective of hypothesis testing and search for the optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. Conclusions: Simulation studies show that the proposed method performs significantly better than the existing competitor in a wide range of settings. An RNA-seq dataset of different species is also analyzed and it coincides with the conclusion that the proposed method outperforms the existing method. For practical applications, we have also developed an R package named SCBN and the software is available at http://www.bioconductor.org/packages/devel/bioc/html/SCBN.html.
77 - Olga Zolotareva 2020
Aggregating transcriptomics data across hospitals can increase sensitivity and robustness of differential expression analyses, yielding deeper clinical insights. As data exchange is often restricted by privacy legislation, meta-analyses are frequentl y employed to pool local results. However, if class labels are inhomogeneously distributed between cohorts, their accuracy may drop. Flimma (https://exbio.wzw.tum.de/flimma/) addresses this issue by implementing the state-of-the-art workflow limma voom in a privacy-preserving manner, i.e. patient data never leaves its source site. Flimma results are identical to those generated by limma voom on combined datasets even in imbalanced scenarios where meta-analysis approaches fail.
Quantitative methods for studying biodiversity have been traditionally rooted in the classical theory of finite frequency tables analysis. However, with the help of modern experimental tools, like high throughput sequencing, we now begin to unlock th e outstanding diversity of genomic data in plants and animals reflective of the long evolutionary history of our planet. This molecular data often defies the classical frequency/contingency tables assumptions and seems to require sparse tables with very large number of categories and highly unbalanced cell counts, e.g., following heavy tailed distributions (for instance, power laws). Motivated by the molecular diversity studies, we propose here a frequency-based framework for biodiversity analysis in the asymptotic regime where the number of categories grows with sample size (an infinite contingency table). Our approach is rooted in information theory and based on the Gaussian limit results for the effective number of species (the Hill numbers) and the empirical Renyi entropy and divergence. We argue that when applied to molecular biodiversity analysis our methods can properly account for the complicated data frequency patterns on one hand and the practical sample size limitations on the other. We illustrate this principle with two specific RNA sequencing examples: a comparative study of T-cell receptor populations and a validation of some preselected molecular hepatocellular carcinoma (HCC) markers.
Tiling arrays make possible a large scale exploration of the genome thanks to probes which cover the whole genome with very high density until 2 000 000 probes. Biological questions usually addressed are either the expression difference between two c onditions or the detection of transcribed regions. In this work we propose to consider simultaneously both questions as an unsupervised classification problem by modeling the joint distribution of the two conditions. In contrast to previous methods, we account for all available information on the probes as well as biological knowledge like annotation and spatial dependence between probes. Since probes are not biologically relevant units we propose a classification rule for non-connected regions covered by several probes. Applications to transcriptomic and ChIP-chip data of Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the importance of a precise modeling and the region classification.
Data from discovery proteomic and phosphoproteomic experiments typically include missing values that correspond to proteins that have not been identified in the analyzed sample. Replacing the missing values with random numbers, a process known as imp utation, avoids apparent infinite fold-change values. However, the procedure comes at a cost: Imputing a large number of missing values has the potential to significantly impact the results of the subsequent differential expression analysis. We propose a method that identifies differentially expressed proteins by ranking their observed changes with respect to the changes observed for other proteins. Missing values are taken into account by this method directly, without the need to impute them. We illustrate the performance of the new method on two distinct datasets and show that it is robust to missing values and, at the same time, provides results that are otherwise similar to those obtained with edgeR which is a state-of-art differential expression analysis method. The new method for the differential expression analysis of proteomic data is available as an easy to use Python package.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا