No Arabic abstract
While measurement advances now allow extensive surveys of gene activity (large numbers of genes across many samples), interpretation of these data is often confounded by noise -- expression counts can differ strongly across samples due to variation of both biological and experimental origin. Complimentary to perturbation approaches, we extract functionally related groups of genes by analyzing the standing variation within a sampled population. To distinguish biologically meaningful patterns from uninterpretable noise, we focus on correlated variation and develop a novel density-based clustering approach that takes advantage of a percolation transition generically arising in random, uncorrelated data. We apply our approach to two contrasting RNA sequencing data sets that sample individual variation -- across single cells of fission yeast and whole animals of C. elegans worms -- and demonstrate robust applicability and versatility in revealing correlated gene clusters of diverse biological origin, including cell cycle phase, development/reproduction, tissue-specific functions, and feeding history. Our technique exploits generic features of noisy high-dimensional data and is applicable, beyond gene expression, to feature-rich data that sample population-level variability in the presence of noise.
The var genes of the human malaria parasite Plasmodium falciparum present a challenge to population geneticists due to their extreme diversity, which is generated by high rates of recombination. These genes encode a primary antigen protein called PfEMP1, which is expressed on the surface of infected red blood cells and elicits protective immune responses. Var gene sequences are characterized by pronounced mosaicism, precluding the use of traditional phylogenetic tools that require bifurcating tree-like evolutionary relationships. We present a new method that identifies highly variable regions (HVRs), and then maps each HVR to a complex network in which each sequence is a node and two nodes are linked if they share an exact match of significant length. Here, networks of var genes that recombine freely are expected to have a uniformly random structure, but constraints on recombination will produce network communities that we identify using a stochastic block model. We validate this method on synthetic data, showing that it correctly recovers populations of constrained recombination, before applying it to the Duffy Binding Like-{alpha} (DBL{alpha}) domain of var genes. We find nine HVRs whose network communities map in distinctive ways to known DBL{alpha} classifications and clinical phenotypes. We show that the recombinational constraints of some HVRs are correlated, while others are independent. These findings suggest that this micromodular structuring facilitates independent evolutionary trajectories of neighboring mosaic regions, allowing the parasite to retain protein function while generating enormous sequence diversity. Our approach therefore offers a rigorous method for analyzing evolutionary constraints in var genes, and is also flexible enough to be easily applied more generally to any highly recombinant sequences.
We show, that the specific distribution of genes length, which is observed in natural genomes, might be a result of a growth process, in which a single length scale $L(t)$ develops that grows with time as $t^{1/3}$. This length scale could be associated with the length of the longest gene in an evolving genome. The growth kinetics of the genes resembles the one observed in physical systems with conserved ordered parameter. We show, that in genome this conservation is guaranteed by compositional compensation along DNA strands of the purine-like trends introduced by genes. The presented mathematical model is the modified Bak-Sneppen model of critical self-organization applied to the one-dimensional system of $N$ spins. The spins take discrete values, which represent genes length.
Motivation: In this paper we present the latest release of EBIC, a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding support for big data, making it possible to efficiently run large genomic data mining analyses. Additional enhancements include integration with R and Bioconductor and an option to remove influence of missing value on the final result. Results: EBIC was applied to datasets of different sizes, including a large DNA methylation dataset with 436,444 rows. For the largest dataset we observed over 6.6 fold speedup in computation time on a cluster of 8 GPUs compared to running the method on a single GPU. This proves high scalability of the algorithm. Availability: The latest version of EBIC could be downloaded from http://github.com/EpistasisLab/ebic . Installation and usage instructions are also available online.
Motivation: Omics data, such as transcriptomics or phosphoproteomics, are broadly used to get a snap-shot of the molecular status of cells. In particular, changes in omics can be used to estimate the activity of pathways, transcription factors and kinases based on known regulated targets, that we call footprints. Then the molecular paths driving these activities can be estimated using causal reasoning on large signaling networks. Results: We have developed FUNKI, a FUNctional toolKIt for footprint analysis. It provides a user-friendly interface for an easy and fast analysis of several omics data, either from bulk or single-cell experiments. FUNKI also features different options to visualise the results and run post-analyses, and is mirrored as a scripted version in R. Availability: FUNKI is a free and open-source application built on R and Shiny, available in GitHub at https://github.com/saezlab/ShinyFUNKI under GNU v3.0 license and accessible also in https://saezlab.shinyapps.io/funki/ Contact:
[email protected] Supplementary information: We provide data examples within the app, as well as extensive information about the different variables to select, the results, and the different plots in the help page.
The leaves of the Coriandrum sativum plant, known as cilantro or coriander, are widely used in many cuisines around the world. However, far from being a benign culinary herb, cilantro can be polarizing---many people love it while others claim that it tastes or smells foul, often like soap or dirt. This soapy or pungent aroma is largely attributed to several aldehydes present in cilantro. Cilantro preference is suspected to have a genetic component, yet to date nothing is known about specific mechanisms. Here we present the results of a genome-wide association study among 14,604 participants of European ancestry who reported whether cilantro tasted soapy, with replication in a distinct set of 11,851 participants who declared whether they liked cilantro. We find a single nucleotide polymorphism (SNP) significantly associated with soapy-taste detection that is confirmed in the cilantro preference group. This SNP, rs72921001, (p=6.4e-9, odds ratio 0.81 per A allele) lies within a cluster of olfactory receptor genes on chromosome 11. Among these olfactory receptor genes is OR6A2, which has a high binding specificity for several of the aldehydes that give cilantro its characteristic odor. We also estimate the heritability of cilantro soapy-taste detection in our cohort, showing that the heritability tagged by common SNPs is low, about 0.087. These results confirm that there is a genetic component to cilantro taste perception and suggest that cilantro dislike may stem from genetic variants in olfactory receptors. We propose that OR6A2 may be the olfactory receptor that contributes to the detection of a soapy smell from cilantro in European populations.