No Arabic abstract
We propose the novel augmented Gaussian random field (AGRF), which is a universal framework incorporating the data of observable and derivatives of any order. Rigorous theory is established. We prove that under certain conditions, the observable and its derivatives of any order are governed by a single Gaussian random field, which is the aforementioned AGRF. As a corollary, the statement ``the derivative of a Gaussian process remains a Gaussian process is validated, since the derivative is represented by a part of the AGRF. Moreover, a computational method corresponding to the universal AGRF framework is constructed. Both noiseless and noisy scenarios are considered. Formulas of the posterior distributions are deduced in a nice closed form. A significant advantage of our computational method is that the universal AGRF framework provides a natural way to incorporate arbitrary order derivatives and deal with missing data. We use four numerical examples to demonstrate the effectiveness of the computational method. The numerical examples are composite function, damped harmonic oscillator, Korteweg-De Vries equation, and Burgers equation.
In this paper we consider the nonparametric functional estimation of the drift of Gaussian processes using Paley-Wiener and Karhunen-Lo`eve expansions. We construct efficient estimators for the drift of such processes, and prove their minimaxity using Bayes estimators. We also construct superefficient estimators of Stein type for such drifts using the Malliavin integration by parts formula and stochastic analysis on Gaussian space, in which superharmonic functionals of the process paths play a particular role. Our results are illustrated by numerical simulations and extend the construction of James-Stein type estimators for Gaussian processes by Berger and Wolper.
Performance guarantees for compression in nonlinear models under non-Gaussian observations can be achieved through the use of distributional characteristics that are sensitive to the distance to normality, and which in particular return the value of zero under Gaussian or linear sensing. The use of these characteristics, or discrepancies, improves some previous results in this area by relaxing conditions and tightening performance bounds. In addition, these characteristics are tractable to compute when Gaussian sensing is corrupted by either additive errors or mixing.
Consider a random vector $mathbf{y}=mathbf{Sigma}^{1/2}mathbf{x}$, where the $p$ elements of the vector $mathbf{x}$ are i.i.d. real-valued random variables with zero mean and finite fourth moment, and $mathbf{Sigma}^{1/2}$ is a deterministic $ptimes p$ matrix such that the spectral norm of the population correlation matrix $mathbf{R}$ of $mathbf{y}$ is uniformly bounded. In this paper, we find that the log determinant of the sample correlation matrix $hat{mathbf{R}}$ based on a sample of size $n$ from the distribution of $mathbf{y}$ satisfies a CLT (central limit theorem) for $p/nto gammain (0, 1]$ and $pleq n$. Explicit formulas for the asymptotic mean and variance are provided. In case the mean of $mathbf{y}$ is unknown, we show that after recentering by the empirical mean the obtained CLT holds with a shift in the asymptotic mean. This result is of independent interest in both large dimensional random matrix theory and high-dimensional statistical literature of large sample correlation matrices for non-normal data. At last, the obtained findings are applied for testing of uncorrelatedness of $p$ random variables. Surprisingly, in the null case $mathbf{R}=mathbf{I}$, the test statistic becomes completely pivotal and the extensive simulations show that the obtained CLT also holds if the moments of order four do not exist at all, which conjectures a promising and robust test statistic for heavy-tailed high-dimensional data.
Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov--Smirnov-type goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford--Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton--Watson related processes.
Nonparametric latent structure models provide flexible inference on distinct, yet related, groups of observations. Each component of a vector of $d ge 2$ random measures models the distribution of a group of exchangeable observations, while their dependence structure regulates the borrowing of information across different groups. Recent work has quantified the dependence between random measures in terms of Wasserstein distance from the maximally dependent scenario when $d=2$. By solving an intriguing max-min problem we are now able to define a Wasserstein index of dependence $I_mathcal{W}$ with the following properties: (i) it simultaneously quantifies the dependence of $d ge 2$ random measures; (ii) it takes values in [0,1]; (iii) it attains the extreme values ${0,1}$ under independence and complete dependence, respectively; (iv) since it is defined in terms of the underlying Levy measures, it is possible to evaluate it numerically in many Bayesian nonparametric models for partially exchangeable data.