No Arabic abstract
Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajimas D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package phasty, and R code for the reproduction of our results is available as an accompanying vignette.
Let $X_{nr}$ be the $r$th largest of a random sample of size $n$ from a distribution $F (x) = 1 - sum_{i = 0}^infty c_i x^{-alpha - i beta}$ for $alpha > 0$ and $beta > 0$. An inversion theorem is proved and used to derive an expansion for the quantile $F^{-1} (u)$ and powers of it. From this an expansion in powers of $(n^{-1}, n^{-beta/alpha})$ is given for the multivariate moments of the extremes ${X_{n, n - s_i}, 1 leq i leq k }/n^{1/alpha}$ for fixed ${bf s} = (s_1, ..., s_k)$, where $k geq 1$. Examples include the Cauchy, Student $t$, $F$, second extreme distributions and stable laws of index $alpha < 1$.
A forward diffusion equation describing the evolution of the allele frequency spectrum is presented. The influx of mutations is accounted for by imposing a suitable boundary condition. For a Wright-Fisher diffusion with or without selection and varying population size, the boundary condition is $lim_{x downarrow 0} x f(x,t)=theta rho(t)$, where $f(cdot,t)$ is the frequency spectrum of derived alleles at independent loci at time $t$ and $rho(t)$ is the relative population size at time $t$. When population size and selection intensity are independent of time, the forward equation is equivalent to the backwards diffusion usually used to derive the frequency spectrum, but the forward equation allows computation of the time dependence of the spectrum both before an equilibrium is attained and when population size and selection intensity vary with time. From the diffusion equation, we derive a set of ordinary differential equations for the moments of $f(cdot,t)$ and express the expected spectrum of a finite sample in terms of those moments. We illustrate the use of the forward equation by considering neutral and selected alleles in a highly simplified model of human history. For example, we show that approximately 30% of the expected heterozygosity of neutral loci is attributable to mutations that arose since the onset of population growth in roughly the last $150,000$ years.
Graphical models express conditional independence relationships among variables. Although methods for vector-valued data are well established, functional data graphical models remain underdeveloped. We introduce a notion of conditional independence between random functions, and construct a framework for Bayesian inference of undirected, decomposable graphs in the multivariate functional data context. This framework is based on extending Markov distributions and hyper Markov laws from random variables to random processes, providing a principled alternative to naive application of multivariate methods to discretized functional data. Markov properties facilitate the composition of likelihoods and priors according to the decomposition of a graph. Our focus is on Gaussian process graphical models using orthogonal basis expansions. We propose a hyper-inverse-Wishart-process prior for the covariance kernels of the infinite coefficient sequences of the basis expansion, establish existence, uniqueness, strong hyper Markov property, and conjugacy. Stochastic search Markov chain Monte Carlo algorithms are developed for posterior inference, assessed through simulations, and applied to a study of brain activity and alcoholism.
In this paper the method of simulated quantiles (MSQ) of Dominicy and Veredas (2013) and Dominick et al. (2013) is extended to a general multivariate framework (MMSQ) and to provide a sparse estimator of the scale matrix (sparse-MMSQ). The MSQ, like alternative likelihood-free procedures, is based on the minimisation of the distance between appropriate statistics evaluated on the true and synthetic data simulated from the postulated model. Those statistics are functions of the quantiles providing an effective way to deal with distributions that do not admit moments of any order like the $alpha$-Stable or the Tukey lambda distribution. The lack of a natural ordering represents the major challenge for the extension of the method to the multivariate framework. Here, we rely on the notion of projectional quantile recently introduced by Hallin etal. (2010) and Kong Mizera (2012). We establish consistency and asymptotic normality of the proposed estimator. The smoothly clipped absolute deviation (SCAD) $ell_1$--penalty of Fan and Li (2001) is then introduced into the MMSQ objective function in order to achieve sparse estimation of the scaling matrix which is the major responsible for the curse of dimensionality problem. We extend the asymptotic theory and we show that the sparse-MMSQ estimator enjoys the oracle properties under mild regularity conditions. The method is illustrated and its effectiveness is tested using several synthetic datasets simulated from the Elliptical Stable distribution (ESD) for which alternative methods are recognised to perform poorly. The method is then applied to build a new network-based systemic risk measurement framework. The proposed methodology to build the network relies on a new systemic risk measure and on a parametric test of statistical dominance.
Regression models describing the joint distribution of multivariate response variables conditional on covariate information have become an important aspect of contemporary regression analysis. However, a limitation of such models is that they often rely on rather simplistic assumptions, e.g. a constant dependency structure that is not allowed to vary with the covariates or the restriction to linear dependence between the responses only. We propose a general framework for multivariate conditional transformation models that overcomes these limitations and describes the entire distribution in a tractable and interpretable yet flexible way conditional on nonlinear effects of covariates. The framework can be embedded into likelihood-based inference, including results on asymptotic normality, and allows the dependence structure to vary with covariates. In addition, the framework scales well beyond bivariate response situations, which were the main focus of most earlier investigations. We illustrate the application of multivariate conditional transformation models in a trivariate analysis of childhood undernutrition and demonstrate empirically that our approach can be beneficial compared to existing benchmarks such that complex truly multivariate data-generating processes can be inferred from observations.