No Arabic abstract
Historically, the majority of statistical association methods have been designed assuming availability of SNP-level information. However, modern genetic and sequencing data present new challenges to access and sharing of genotype-phenotype datasets, including cost management, difficulties in consolidation of records across research groups, etc. These issues make methods based on SNP-level summary statistics for a joint analysis of variants in a group particularly appealing. The most common form of combining statistics is a sum of SNP-level squared scores, possibly weighted, as in burden tests for rare variants. The overall significance of the resulting statistic is evaluated using its distribution under the null hypothesis. Here, we demonstrate that this basic approach can be substantially improved by decorrelating scores prior to their addition, resulting in remarkable power gains in situations that are most commonly encountered in practice; namely, under heterogeneity of effect sizes and diversity between pairwise LD. In these situations, the power of the traditional test, based on the added squared scores, quickly reaches a ceiling, as the number of variants increases. Thus, the traditional approach does not benefit from information potentially contained in any additional SNPs, while our decorrelation by orthogonal transformation (DOT) method yields steady gain in power. We present theoretical and computational analyses of both approaches, and reveal causes behind sometimes dramatic difference in their respective powers. We showcase DOT by analyzing breast cancer data, in which our method strengthened levels of previously reported associations and implied the possibility of multiple new alleles that jointly confer breast cancer risk.
Summary: Interpretation and prioritization of candidate hits from genome-scale screening experiments represent a significant analytical challenge, particularly when it comes to an understanding of cancer relevance. We have developed a flexible tool that substantially refines gene set interpretability in cancer by leveraging a broad scope of prior knowledge unavailable in existing frameworks, including data on target tractabilities, tumor-type association strengths, protein complexes and protein-protein interactions, tissue and cell-type expression specificities, subcellular localizations, prognostic associations, cancer dependency maps, and information on genes of poorly defined or unknown function. Availability: oncoEnrichR is developed in R, and is freely available as a stand-alone R package. A web interface to oncoEnrichR is provided through the Galaxy framework (https://oncotools.elixir.no). All code is open-source under the MIT license, with documentation, example datasets and and instructions for usage available at https://github.com/sigven/oncoEnrichR/ Contact:
[email protected]
Approaches for testing sets of variants, such as a set of rare or common variants within a gene or pathway, for association with complex traits are important. In particular, set tests allow for aggregation of weak signal within a set, can capture interplay among variants, and reduce the burden of multiple hypothesis testing. Until now, these approaches did not address confounding by family relatedness and population structure, a problem that is becoming more important as larger data sets are used to increase power. Results: We introduce a new approach for set tests that handles confounders. Our model is based on the linear mixed model and uses two random effects-one to capture the set association signal and one to capture confounders. We also introduce a computational speedup for two-random-effects models that makes this approach feasible even for extremely large cohorts. Using this model with both the likelihood ratio test and score test, we find that the former yields more power while controlling type I error. Application of our approach to richly structured GAW14 data demonstrates that our method successfully corrects for population structure and family relatedness, while application of our method to a 15,000 individual Crohns disease case-control cohort demonstrates that it additionally recovers genes not recoverable by univariate analysis. Availability: A Python-based library implementing our approach is available at http://mscompbio.codeplex.com
Linear mixed models (LMMs) are a powerful and established tool for studying genotype-phenotype relationships. A limiting assumption of LMMs is that the residuals are Gaussian distributed, a requirement that rarely holds in practice. Violations of this assumption can lead to false conclusions and losses in power, and hence it is common practice to pre-process the phenotypic values to make them Gaussian, for instance by applying logarithmic or other non-linear transformations. Unfortunately, different phenotypes require different specific transformations, and choosing a good transformation is in general challenging and subjective. Here, we present an extension of the LMM that estimates an optimal transformation from the observed data. In extensive simulations and applications to real data from human, mouse and yeast we show that using such optimal transformations lead to increased power in genome-wide association studies and higher accuracy in heritability estimates and phenotype predictions.
Stratifying cancer patients based on their gene expression levels allows improving diagnosis, survival analysis and treatment planning. However, such data is extremely highly dimensional as it contains expression values for over 20000 genes per patient, and the number of samples in the datasets is low. To deal with such settings, we propose to incorporate prior biological knowledge about genes from ontologies into the machine learning system for the task of patient classification given their gene expression data. We use ontology embeddings that capture the semantic similarities between the genes to direct a Graph Convolutional Network, and therefore sparsify the network connections. We show this approach provides an advantage for predicting clinical targets from high-dimensional low-sample data.
The analysis of differential gene expression from RNA-Seq data has become a standard for several research areas mainly involving bioinformatics. The steps for the computational analysis of these data include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of differential expression analysis pipeline, addressing its steps and the respective objectives, the principal methods available in each step and their properties, bringing an overview in an organized way in this context. In particular, this review aims to address mainly the aspects involved in the differentially expressed gene (DEG) analysis from RNA sequencing data (RNA-Seq), considering the computational methods and its properties. In addition, a timeline of the evolution of computational methods for DEG is presented and discussed, as well as the relationships existing between the main computational tools are presented by an interaction network. A discussion on the challenges and gaps in DEG analysis is also highlighted in this review.