No Arabic abstract
Quantitative methods for studying biodiversity have been traditionally rooted in the classical theory of finite frequency tables analysis. However, with the help of modern experimental tools, like high throughput sequencing, we now begin to unlock the outstanding diversity of genomic data in plants and animals reflective of the long evolutionary history of our planet. This molecular data often defies the classical frequency/contingency tables assumptions and seems to require sparse tables with very large number of categories and highly unbalanced cell counts, e.g., following heavy tailed distributions (for instance, power laws). Motivated by the molecular diversity studies, we propose here a frequency-based framework for biodiversity analysis in the asymptotic regime where the number of categories grows with sample size (an infinite contingency table). Our approach is rooted in information theory and based on the Gaussian limit results for the effective number of species (the Hill numbers) and the empirical Renyi entropy and divergence. We argue that when applied to molecular biodiversity analysis our methods can properly account for the complicated data frequency patterns on one hand and the practical sample size limitations on the other. We illustrate this principle with two specific RNA sequencing examples: a comparative study of T-cell receptor populations and a validation of some preselected molecular hepatocellular carcinoma (HCC) markers.
In most of the recent immunological literature the differences across antigen receptor populations are examined via non-parametric statistical measures of species overlap and diversity borrowed from ecological studies. While this approach is robust in a wide range of situations, it seems to provide little insight into the underlying clonal size distribution and the overall mechanism differentiating the receptor populations. As a possible alternative, the current paper presents a parametric method which adjusts for the data under-sampling as well as provides a unifying approach to simultaneous comparison of multiple receptor groups by means of the modern statistical tools of unsupervised learning. The parametric model is based on a flexible multivariate Poisson-lognormal distribution and is seen to be a natural generalization of the univariate Poisson-lognormal models used in ecological studies of biodiversity patterns. The procedure for evaluating models fit is described along with the public domain software developed to perform the necessary diagnostics. The model-driven analysis is seen to compare favorably vis a vis traditional methods when applied to the data from T-cell receptors in transgenic mice populations.
This paper studies forward and reverse projections for the R{e}nyi divergence of order $alpha in (0, infty)$ on $alpha$-convex sets. The forward projection on such a set is motivated by some works of Tsallis {em et al.} in statistical physics, and the reverse projection is motivated by robust statistics. In a recent work, van Erven and Harremoes proved a Pythagorean inequality for R{e}nyi divergences on $alpha$-convex sets under the assumption that the forward projection exists. Continuing this study, a sufficient condition for the existence of forward projection is proved for probability measures on a general alphabet. For $alpha in (1, infty)$, the proof relies on a new Apollonius theorem for the Hellinger divergence, and for $alpha in (0,1)$, the proof relies on the Banach-Alaoglu theorem from functional analysis. Further projection results are then obtained in the finite alphabet setting. These include a projection theorem on a specific $alpha$-convex set, which is termed an {em $alpha$-linear family}, generalizing a result by Csiszar for $alpha eq 1$. The solution to this problem yields a parametric family of probability measures which turns out to be an extension of the exponential family, and it is termed an {em $alpha$-exponential family}. An orthogonality relationship between the $alpha$-exponential and $alpha$-linear families is established, and it is used to turn the reverse projection on an $alpha$-exponential family into a forward projection on a $alpha$-linear family. This paper also proves a convergence result of an iterative procedure used to calculate the forward projection on an intersection of a finite number of $alpha$-linear families.
It is largely taken for granted that differential abundance analysis is, by default, the best first step when analyzing genomic data. We argue that this is not necessarily the case. In this article, we identify key limitations that are intrinsic to differential abundance analysis: it is (a) dependent on unverifiable assumptions, (b) an unreliable construct, and (c) overly reductionist. We formulate an alternative framework called ratio-based biomarker analysis which does not suffer from the identified limitations. Moreover, ratio-based biomarkers are highly flexible. Beyond replacing DAA, they can also be used for many other bespoke analyses, including dimension reduction and multi-omics data integration.
Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional responses. To remove systematic variation between different species for a fair comparison, the normalization procedure serves as a crucial pre-processing step that adjusts for the varying sample sequencing depths and other confounding technical effects. Results: In this paper, we propose a scale based normalization (SCBN) method by taking into account the available knowledge of conserved orthologous genes and hypothesis testing framework. Considering the different gene lengths and unmapped genes between different species, we formulate the problem from the perspective of hypothesis testing and search for the optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. Conclusions: Simulation studies show that the proposed method performs significantly better than the existing competitor in a wide range of settings. An RNA-seq dataset of different species is also analyzed and it coincides with the conclusion that the proposed method outperforms the existing method. For practical applications, we have also developed an R package named SCBN and the software is available at http://www.bioconductor.org/packages/devel/bioc/html/SCBN.html.
Fawzi and Fawzi recently defined the sharp Renyi divergence, $D_alpha^#$, for $alpha in (1, infty)$, as an additional quantum Renyi divergence with nice mathematical properties and applications in quantum channel discrimination and quantum communication. One of their open questions was the limit ${alpha} to 1$ of this divergence. By finding a new expression of the sharp divergence in terms of a minimization of the geometric Renyi divergence, we show that this limit is equal to the Belavkin-Staszewski relative entropy. Analogous minimizations of arbitrary generalized divergences lead to a new family of generalized divergences that we call kringel divergences, and for which we prove various properties including the data-processing inequality.