No Arabic abstract
The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.
Single cell responses are shaped by the geometry of signaling kinetic trajectories carved in a multidimensional space spanned by signaling protein abundances. It is however challenging to assay large number (>3) of signaling species in live-cell imaging which makes it difficult to probe single cell signaling kinetic trajectories in large dimensions. Flow and mass cytometry techniques can measure a large number (4 - >40) of signaling species but are unable to track single cells. Thus cytometry experiments provide detailed time stamped snapshots of single cell signaling kinetics. Is it possible to use the time stamped cytometry data to reconstruct single cell signaling trajectories? Borrowing concepts of conserved and slow variables from non-equilibrium statistical physics we develop an approach to reconstruct signaling trajectories using snapshot data by creating new variables that remain invariant or vary slowly during the signaling kinetics. We apply this approach to reconstruct trajectories using snapshot data obtained from in silico simulations and live-cell imaging measurements. The use of invariants and slow variables to reconstruct trajectories provides a radically different way to track object using snapshot data. The approach is likely to have implications for solving matching problems in a wide range of disciplines.
The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational methods have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet-detection methods. DoubletCollection also provides a unified interface to perform and visualize downstream analysis after doublet detection. Here, we present a protocol of using DoubletCollection to benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field.
According to the National Cancer Institute, there were 9.5 million cancer-related deaths in 2018. A challenge in improving treatment is resistance in genetically unstable cells. The purpose of this study is to evaluate unsupervised machine learning on classifying treatment-resistant phenotypes in heterogeneous tumors through analysis of single cell RNA sequencing(scRNAseq) data with a pipeline and evaluation metrics. scRNAseq quantifies mRNA in cells and characterizes cell phenotypes. One scRNAseq dataset was analyzed (tumor/non-tumor cells of different molecular subtypes and patient identifications). The pipeline consisted of data filtering, dimensionality reduction with Principal Component Analysis, projection with Uniform Manifold Approximation and Projection, clustering with nine approaches (Ward, BIRCH, Gaussian Mixture Model, DBSCAN, Spectral, Affinity Propagation, Agglomerative Clustering, Mean Shift, and K-Means), and evaluation. Seven models divided tumor versus non-tumor cells and molecular subtype while six models classified different patient identification (13 of which were presented in the dataset); K-Means, Ward, and BIRCH often ranked highest with ~80% accuracy on the tumor versus non-tumor task and ~60% for molecular subtype and patient ID. An optimized classification pipeline using K-Means, Ward, and BIRCH models was evaluated to be most effective for further analysis. In clinical research where there is currently no standard protocol for scRNAseq analysis, clusters generated from this pipeline can be used to understand cancer cell behavior and malignant growth, directly affecting the success of treatment.
The collection of immunoglobulin genes in an individuals germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and contains a significant number of spurious alleles. The resulting uncertainty as to which immunoglobulin alleles are present in any given sample results in inaccurate B cell receptor sequence annotations, and in particular inaccurate inferred naive ancestors. In this paper we first show that the currently widespread practice of aligning each sequence to its closest match in the full set of IMGT alleles results in a very large number of spurious alleles that are not in the samples true set of germline V alleles. We then describe a new method for inferring each individuals germline gene set from deep sequencing data, and show that it improves upon existing methods by making a detailed comparison on a variety of simulated and real data samples. This new method has been integrated into the partis annotation and clonal family inference package, available at https://github.com/psathyrella/partis, and is run by default without affecting overall run time.
We develop a theoretical approach that uses physiochemical kinetics modelling to describe cell population dynamics upon progression of viral infection in cell culture, which results in cell apoptosis (programmed cell death) and necrosis (direct cell death). Several model parameters necessary for computer simulation were determined by reviewing and analyzing available published experimental data. By comparing experimental data to computer modelling results, we identify the parameters that are the most sensitive to the measured system properties and allow for the best data fitting. Our model allows extraction of parameters from experimental data and also has predictive power. Using the model we describe interesting time-dependent quantities that were not directly measured in the experiment, and identify correlations among the fitted parameter values. Numerical simulation of viral infection progression is done by a rate-equation approach resulting in a system of stiff equations, which are solved by using a novel variant of the stochastic ensemble modelling approach. The latter was originally developed for coupled chemical reactions.