No Arabic abstract
We consider a discrete-time Markov chain, called fragmentation process, that describes a specific way of successively removing objects from a linear arrangement. The process arises in population genetics and describes the ancestry of the genetic material of individuals in a population experiencing recombination. We aim at the law of the process over time. To this end, we investigate sets of realisations of this process that agree with respect to a specific order of events and represent each such set by a rooted (binary) tree. The probability of each tree is, in turn, obtained by Mobius inversion on a suitable poset of all rooted forests that can be obtained from the tree by edge deletion; we call this poset the textit{pruning poset}. Dependencies within the fragments make it difficult to obtain explicit expressions for the probabilities of the trees. We therefore construct an auxiliary process for every given tree, which is i.i.d. over time, and which allows to give a pathwise construction of realisations that match the tree.
Moran or Wright-Fisher processes are probably the most well known model to study the evolution of a population under various effects. Our object of study will be the Simpson index which measures the level of diversity of the population, one of the key parameter for ecologists who study for example forest dynamics. Following ecological motivations, we will consider here the case where there are various species with fitness and immigration parameters being random processes (and thus time evolving). To measure biodiversity, ecologists generally use the Simpson index, who has no closed formula, except in the neutral (no selection) case via a backward approach, and which is difficult to evaluate even numerically when the population size is large. Our approach relies on the large population limit in the weak selection case, and thus to give a procedure which enable us to approximate, with controlled rate, the expectation of the Simpson index at fixed time. Our approach will be forward and valid for all time, which is the main difference with the historical approach of Kingman, or Krone-Neuhauser. We will also study the long time behaviour of the Wright-Fisher process in a simplified setting, allowing us to get a full picture for the approximation of the expectation of the Simpson index.
The stationary asymptotic properties of the diffusion limit of a multi-type branching process with neutral mutations are studied. For the critical and subcritical processes the interesting limits are those of quasi-stationary distributions conditioned on non-extinction. Limiting distributions for supercritical and critical processes are found to collapse onto rays aligned with stationary eigenvectors of the mutation rate matrix, in agreement with known results for discrete multi-type branching processes. For the sub-critical process the quasi-stationary distribution is obtained to first order in the overall mutation rate, which is assumed to be small. The sampling distribution over allele types for a sample of given finite size is found to agree to first order in mutation rates with the analogous sampling distribution for a Wright-Fisher diffusion with constant population size.
Using graphical methods based on a `lookdown and pruned version of the {em ancestral selection graph}, we obtain a representation of the type distribution of the ancestor in a two-type Wright-Fisher population with mutation and selection, conditional on the overall type frequency in the old population. This extends results from Lenz, Kluth, Baake, and Wakolbinger (Theor. Pop. Biol., 103 (2015), 27-37) to the case of heavy-tailed offspring, directed by a reproduction measure $Lambda$. The representation is in terms of the equilibrium tail probabilities of the line-counting process $L$ of the graph. We identify a strong pathwise Siegmund dual of $L$, and characterise the equilibrium tail probabilities of $L$ in terms of hitting probabilities of the dual process.
Measures of tree balance play an important role in various research areas, for example in phylogenetics. There they are for instance used to test whether an observed phylogenetic tree differs significantly from a tree generated by the Yule model of speciation. One of the most popular indices in this regard is the Colless index, which measures the degree of balance for rooted binary trees. While many statistical properties of the Colless index (e.g. asymptotic results for its mean and variance under different models of speciation) have already been discussed in different contexts, we focus on its extremal properties. While it is relatively straightforward to characterize trees with maximal Colless index, the analysis of the minimal value of the Colless index and the characterization of trees that achieve it, are much more involved. In this note, we therefore focus on the minimal value of the Colless index for any given number of leaves. We derive both a recursive formula for this minimal value, as well as an explicit expression, which shows a surprising connection between the Colless index and the so-called Blancmange curve, a fractal curve that is also known as the Takagi curve. Moreover, we characterize two classes of trees that have minimal Colless index, consisting of the set of so-called emph{maximally balanced trees} and a class of trees that we call emph{greedy from the bottom trees}. Furthermore, we derive an upper bound for the number of trees with minimal Colless index by relating these trees with trees with minimal Sackin index (another well-studied index of tree balance).
Approximate Bayesian computation (ABC) methods provide an elaborate approach to Bayesian inference on complex models, including model choice. Both theoretical arguments and simulation experiments indicate, however, that model posterior probabilities may be poorly evaluated by standard ABC techniques. We propose a novel approach based on a machine learning tool named random forests to conduct selection among the highly complex models covered by ABC algorithms. We thus modify the way Bayesian model selection is both understood and operated, in that we rephrase the inferential goal as a classification problem, first predicting the model that best fits the data with random forests and postponing the approximation of the posterior probability of the predicted MAP for a second stage also relying on random forests. Compared with earlier implementations of ABC model choice, the ABC random forest approach offers several potential improvements: (i) it often has a larger discriminative power among the competing models, (ii) it is more robust against the number and choice of statistics summarizing the data, (iii) the computing effort is drastically reduced (with a gain in computation efficiency of at least fifty), and (iv) it includes an approximation of the posterior probability of the selected model. The call to random forests will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of this novel methodology by analyzing controlled experiments as well as genuine population genetics datasets. The proposed methodologies are implemented in the R package abcrf available on the CRAN.