No Arabic abstract
In most of the recent immunological literature the differences across antigen receptor populations are examined via non-parametric statistical measures of species overlap and diversity borrowed from ecological studies. While this approach is robust in a wide range of situations, it seems to provide little insight into the underlying clonal size distribution and the overall mechanism differentiating the receptor populations. As a possible alternative, the current paper presents a parametric method which adjusts for the data under-sampling as well as provides a unifying approach to simultaneous comparison of multiple receptor groups by means of the modern statistical tools of unsupervised learning. The parametric model is based on a flexible multivariate Poisson-lognormal distribution and is seen to be a natural generalization of the univariate Poisson-lognormal models used in ecological studies of biodiversity patterns. The procedure for evaluating models fit is described along with the public domain software developed to perform the necessary diagnostics. The model-driven analysis is seen to compare favorably vis a vis traditional methods when applied to the data from T-cell receptors in transgenic mice populations.
B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same clonal family) are released from the germinal center, their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-family-specific frequency profiles, called substitution profiles, are useful for studying the course of affinity maturation as well as for antibody engineering purposes. However, most often only a single sequence is recovered from each clonal family in a sequencing experiment, making it impossible to construct a clonal-family-specific substitution profile. Given the public release of many high-quality large B cell receptor datasets, one may ask whether it is possible to use such data in a prediction model for clonal-family-specific substitution profiles. In this paper, we present the method Substitution Profiles Using Related Families (SPURF), a penalized tensor regression framework that integrates information from a rich assemblage of datasets to predict the clonal-family-specific substitution profile for any single input sequence. Using this framework, we show that substitution profiles from similar clonal families can be leveraged together with simulated substitution profiles and germline gene sequence information to improve prediction. We fit this model on a large public dataset and validate the robustness of our approach on an external dataset. Furthermore, we provide a command-line tool in an open-source software package (https://github.com/krdav/SPURF) implementing these ideas and providing easy prediction using our pre-fit models.
The nicotinic acetylcholine receptor (nAChR) is the prototypic member of the `Cys-loop superfamily of ligand-gated ion channels which mediate synaptic neurotransmission, and whose other members include receptors for glycine, gamma-aminobutyric acid, and serotonin. Cryo-electron microscopy has yielded a three dimensional structure of the nAChR in its closed state. However, the exact nature and location of the channel gate remains uncertain. Although the transmembrane pore is constricted close to its center, it is not completely occluded. Rather, the pore has a central hydrophobic zone of radius about 3 A. Model calculations suggest that such a constriction may form a hydrophobic gate, preventing movement of ions through a channel. We present a detailed and quantitative simulation study of the hydrophobic gating model of the nicotinic receptor, in order to fully evaluate this hypothesis. We demonstrate that the hydrophobic constriction of the nAChR pore indeed forms a closed gate. Potential of mean force (PMF) calculations reveal that the constriction presents a barrier of height ca. 10 kT to the permeation of sodium ions, placing an upper bound on the closed channel conductance of 0.3 pS. Thus, a 3 A radius hydrophobic pore can form a functional barrier to the permeation of a 1 A radius Na+ ion. Using a united atom force field for the protein instead of an all atom one retains the qualitative features but results in differing conductances, showing that the PMF is sensitive to the detailed molecular interactions.
Quantifying interactions in DNA microarrays is of central importance for a better understanding of their functioning. Hybridization thermodynamics for nucleic acid strands in aqueous solution can be described by the so-called nearest-neighbor model, which estimates the hybridization free energy of a given sequence as a sum of dinucleotide terms. Compared with its solution counterparts, hybridization in DNA microarrays may be hindered due to the presence of a solid surface and of a high density of DNA strands. We present here a study aimed at the determination of hybridization free energies in DNA microarrays. Experiments are performed on custom Agilent slides. The solution contains a single oligonucleotide. The microarray contains spots with a perfect matching complementary sequence and other spots with one or two mismatches: in total 1006 different probe spots, each replicated 15 times per microarray. The free energy parameters are directly fitted from microarray data. The experiments demonstrate a clear correlation between hybridization free energies in the microarray and in solution. The experiments are fully consistent with the Langmuir model at low intensities, but show a clear deviation at intermediate (non-saturating) intensities. These results provide new interesting insights for the quantification of molecular interactions in DNA microarrays.
The technology to generate Spatially Resolved Transcriptomics (SRT) data is rapidly being improved and applied to investigate a variety of biological tissues. The ability to interrogate how spatially localised gene expression can lend new insight to different tissue development is critical, but the appropriate tools to analyse this data are still emerging. This chapter reviews available packages and pipelines for the analysis of different SRT datasets with a focus on identifying spatially variable genes (SVGs) alongside other aims, while discussing the importance of and challenges in establishing a standardised ground truth in the biological data for benchmarking.
Quantitative methods for studying biodiversity have been traditionally rooted in the classical theory of finite frequency tables analysis. However, with the help of modern experimental tools, like high throughput sequencing, we now begin to unlock the outstanding diversity of genomic data in plants and animals reflective of the long evolutionary history of our planet. This molecular data often defies the classical frequency/contingency tables assumptions and seems to require sparse tables with very large number of categories and highly unbalanced cell counts, e.g., following heavy tailed distributions (for instance, power laws). Motivated by the molecular diversity studies, we propose here a frequency-based framework for biodiversity analysis in the asymptotic regime where the number of categories grows with sample size (an infinite contingency table). Our approach is rooted in information theory and based on the Gaussian limit results for the effective number of species (the Hill numbers) and the empirical Renyi entropy and divergence. We argue that when applied to molecular biodiversity analysis our methods can properly account for the complicated data frequency patterns on one hand and the practical sample size limitations on the other. We illustrate this principle with two specific RNA sequencing examples: a comparative study of T-cell receptor populations and a validation of some preselected molecular hepatocellular carcinoma (HCC) markers.