No Arabic abstract
Increased availability of data and accessibility of computational tools in recent years have created unprecedented opportunities for scientific research driven by statistical analysis. Inherent limitations of statistics impose constrains on reliability of conclusions drawn from data but misuse of statistical methods is a growing concern. Significance, hypothesis testing and the accompanying P-values are being scrutinized as representing most widely applied and abused practices. One line of critique is that P-values are inherently unfit to fulfill their ostensible role as measures of scientific hypothesiss credibility. It has also been suggested that while P-values may have their role as summary measures of effect, researchers underappreciate the degree of randomness in the P-value. High variability of P-values would suggest that having obtained a small P-value in one study, one is, nevertheless, likely to obtain a much larger P-value in a similarly powered replication study. Thus, replicability of P-value is itself questionable. To characterize P-value variability one can use prediction intervals whose endpoints reflect the likely spread of P-values that could have been obtained by a replication study. Unfortunately, the intervals currently in use, the P-intervals, are based on unrealistic implicit assumptions. Namely, P-intervals are constructed with the assumptions that imply substantial chances of encountering large values of effect size in an observational study, which leads to bias. As an alternative to P-intervals, we develop a method that gives researchers flexibility by providing them with the means to control these assumptions. Unlike endpoints of P-intervals, endpoints of our intervals are directly interpreted as probabilistic bounds for replication P-values and are resistant to selection bias contingent upon approximate prior knowledge of the effect size distribution.
We provide accessible insight into the current replication crisis in statistical science, by revisiting the old metaphor of court trial as hypothesis test. Inter alia, we define and diagnose harmful statistical witch-hunting both in justice and science, which extends to the replication crisis itself, where a hunt on p-values is currently underway.
Increasing accessibility of data to researchers makes it possible to conduct massive amounts of statistical testing. Rather than follow a carefully crafted set of scientific hypotheses with statistical analysis, researchers can now test many possible relations and let P-values or other statistical summaries generate hypotheses for them. Genetic epidemiology field is an illustrative case in this paradigm shift. Driven by technological advances, testing a handful of genetic variants in relation to a health outcome has been abandoned in favor of agnostic screening of the entire genome, followed by selection of top hits, e.g., by selection of genetic variants with the smallest association P-values. At the same time, nearly total lack of replication of claimed associations that has been shaming the field turned to a flow of reports whose findings have been robustly replicating. Researchers may have adopted better statistical practices by learning from past failures, but we suggest that a steep increase in the amount of statistical testing itself is an important factor. Regardless of whether statistical significance has been reached, an increased number of tested hypotheses leads to enrichment of smallest P-values with genuine associations. In this study, we quantify how the expected proportion of genuine signals (EPGS) among top hits changes with an increasing number of tests. When the rate of occurrence of genuine signals does not decrease too sharply to zero as more tests are performed, the smallest P-values are increasingly more likely to represent genuine associations in studies with more tests.
With the advent of high-throughput sequencing technologies, the fields of immunogenomics and adaptive immune receptor repertoire research are facing both opportunities and challenges. Adaptive immune receptor repertoire sequencing (AIRR-seq) has become an increasingly important tool to characterize T and B cell responses in settings of interest. However, the majority of AIRR-seq studies conducted so far were performed in individuals of European ancestry, restricting the ability to identify variation in human adaptive immune responses across populations and limiting their applications. As AIRR-seq studies depend on the ability to assign VDJ sequence reads to the correct germline gene segments, efforts to characterize the genomic loci that encode adaptive immune receptor genes in different populations are urgently needed. The availability of comprehensive germline gene databases and further applications of AIRR-seq studies to individuals of non-European ancestry will substantially enhance our understanding of human adaptive immune responses, promote the development of effective diagnostics and treatments, and eventually advance precision medicine.
The Cambrian explosion is a grand challenge to science today and involves multidisciplinary study. This event is generally believed as a result of genetic innovations, environmental factors and ecological interactions, even though there are many conflicts on nature and timing of metazoan origins. The crux of the matter is that an entire roadmap of the evolution is missing to discern the biological complexity transition and to evaluate the critical role of the Cambrian explosion in the overall evolutionary context. Here we calculate the time of the Cambrian explosion by an innovative and accurate C-value clock; our result (560 million years ago) quite fits the fossil records. We clarify that the intrinsic reason of genome evolution determined the Cambrian explosion. A general formula for evaluating genome size of different species has been found, by which major questions of the C-value enigma can be solved and the genome size evolution can be illustrated. The Cambrian explosion is essentially a major transition of biological complexity, which corresponds to a turning point in genome size evolution. The observed maximum prokaryotic complexity is just a relic of the Cambrian explosion and it is supervised by the maximum information storage capability in the observed universe. Our results open a new prospect of studying metazoan origins and molecular evolution.
According to the National Cancer Institute, there were 9.5 million cancer-related deaths in 2018. A challenge in improving treatment is resistance in genetically unstable cells. The purpose of this study is to evaluate unsupervised machine learning on classifying treatment-resistant phenotypes in heterogeneous tumors through analysis of single cell RNA sequencing(scRNAseq) data with a pipeline and evaluation metrics. scRNAseq quantifies mRNA in cells and characterizes cell phenotypes. One scRNAseq dataset was analyzed (tumor/non-tumor cells of different molecular subtypes and patient identifications). The pipeline consisted of data filtering, dimensionality reduction with Principal Component Analysis, projection with Uniform Manifold Approximation and Projection, clustering with nine approaches (Ward, BIRCH, Gaussian Mixture Model, DBSCAN, Spectral, Affinity Propagation, Agglomerative Clustering, Mean Shift, and K-Means), and evaluation. Seven models divided tumor versus non-tumor cells and molecular subtype while six models classified different patient identification (13 of which were presented in the dataset); K-Means, Ward, and BIRCH often ranked highest with ~80% accuracy on the tumor versus non-tumor task and ~60% for molecular subtype and patient ID. An optimized classification pipeline using K-Means, Ward, and BIRCH models was evaluated to be most effective for further analysis. In clinical research where there is currently no standard protocol for scRNAseq analysis, clusters generated from this pipeline can be used to understand cancer cell behavior and malignant growth, directly affecting the success of treatment.