No Arabic abstract
We are frequently faced with a large collection of antibodies, and want to select those with highest affinity for their cognate antigen. When developing a first-line therapeutic for a novel pathogen, for instance, we might look for such antibodies in patients that have recovered. There exist effective experimental methods of accomplishing this, such as cell sorting and baiting; however they are time consuming and expensive. Next generation sequencing of B cell receptor (BCR) repertoires offers an additional source of sequences that could be tapped if we had a reliable method of selecting those coding for the best antibodies. In this paper we introduce a method that uses evolutionary information from the family of related sequences that share a naive ancestor to predict the affinity of each resulting antibody for its antigen. When combined with information on the identity of the antigen, this method should provide a source of effective new antibodies. We also introduce a method for a related task: given an antibody of interest and its inferred ancestral lineage, which branches in the tree are likely to harbor key affinity-increasing mutations? These methods are implemented as part of continuing development of the partis BCR inference package, available at https://github.com/psathyrella/partis.
The antibody repertoire of each individual is continuously updated by the evolutionary process of B cell receptor mutation and selection. It has recently become possible to gain detailed information concerning this process through high-throughput sequencing. Here, we develop modern statistical molecular evolution methods for the analysis of B cell sequence data, and then apply them to a very deep short-read data set of B cell receptors. We find that the substitution process is conserved across individuals but varies significantly across gene segments. We investigate selection on B cell receptors using a novel method that side-steps the difficulties encountered by previous work in differentiating between selection and motif-driven mutation; this is done through stochastic mapping and empirical Bayes estimators that compare the evolution of in-frame and out-of-frame rearrangements. We use this new method to derive a per-residue map of selection, which provides a more nuanced view of the constraints on framework and variable regions.
The collection of immunoglobulin genes in an individuals germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and contains a significant number of spurious alleles. The resulting uncertainty as to which immunoglobulin alleles are present in any given sample results in inaccurate B cell receptor sequence annotations, and in particular inaccurate inferred naive ancestors. In this paper we first show that the currently widespread practice of aligning each sequence to its closest match in the full set of IMGT alleles results in a very large number of spurious alleles that are not in the samples true set of germline V alleles. We then describe a new method for inferring each individuals germline gene set from deep sequencing data, and show that it improves upon existing methods by making a detailed comparison on a variety of simulated and real data samples. This new method has been integrated into the partis annotation and clonal family inference package, available at https://github.com/psathyrella/partis, and is run by default without affecting overall run time.
B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same clonal family) are released from the germinal center, their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-family-specific frequency profiles, called substitution profiles, are useful for studying the course of affinity maturation as well as for antibody engineering purposes. However, most often only a single sequence is recovered from each clonal family in a sequencing experiment, making it impossible to construct a clonal-family-specific substitution profile. Given the public release of many high-quality large B cell receptor datasets, one may ask whether it is possible to use such data in a prediction model for clonal-family-specific substitution profiles. In this paper, we present the method Substitution Profiles Using Related Families (SPURF), a penalized tensor regression framework that integrates information from a rich assemblage of datasets to predict the clonal-family-specific substitution profile for any single input sequence. Using this framework, we show that substitution profiles from similar clonal families can be leveraged together with simulated substitution profiles and germline gene sequence information to improve prediction. We fit this model on a large public dataset and validate the robustness of our approach on an external dataset. Furthermore, we provide a command-line tool in an open-source software package (https://github.com/krdav/SPURF) implementing these ideas and providing easy prediction using our pre-fit models.
Naive human T cells are produced in the thymus, which atrophies abruptly and severely in response to physical or psychological stress. To understand how an instance of stress affects the size and diversity of the peripheral naive T cell pool, we derive a mean-field autonomous ODE model of T cell replenishment that allows us to track the clone abundance distribution (the mean number of different TCRs each represented by a specific number of cells). We identify equilibrium solutions that arise at different rates of T cell production, and derive analytic approximations to the dominant eigenvalues and eigenvectors of the problem linearized about these equilibria. From the forms of the eigenvalues and eigenvectors, we estimate rates at which counts of clones of different sizes converge to and depart from equilibrium values--that is, how the number of clones of different sizes adjust to the changing rate of T cell production. Under most physiologically realistic realizations of our model, the dominant eigenvalue (representing the slowest dynamics of the clone abundance distribution) scales as a power law in the thymic output for low output levels, but saturates at higher T cell production rates. Our analysis provides a framework for quantitatively understanding how the clone abundance distributions evolve under small changes in the overall T cell production rate by the thymus.
The set of T cells that express the same T cell receptor (TCR) sequence represents a T cell clone. The number of different naive T cell clones in an organism reflects the number of different T cell receptors (TCRs) arising from recombination of the V(D)J gene segments during T cell development in the thymus. TCR diversity and more specifically, the clone abundance distribution is an important factor in immune function. Specific recombination patterns occur more frequently than others while subsequent interactions between TCRs and self-antigens are known to trigger proliferation and sustain naive T cell survival. These processes are TCR-dependent, leading to clone-dependent thymic export and naive T cell proliferation rates. Using a mean-field approximation to the solution of a regulated birth-death-immigration model and a modification arising from sampling, we systematically quantify how TCR-dependent heterogeneities in immigration and proliferation rates affect the shape of clone abundance distributions (the number of different clones that are represented by a specific number of cells, or clone counts). By comparing predicted clone counts derived from our heterogeneous birth-death-immigration model with experimentally sampled clone abundances, we show that although heterogeneity in immigration rates causes very little change to predicted clone-counts, significant heterogeneity in proliferation rates is necessary to generate the observed abundances with reasonable physiological parameter values. Our analysis provides constraints among physiological parameters that are necessary to yield predictions that qualitatively match the data. Assumptions of the model and potentially other important mechanistic factors are discussed.