No Arabic abstract
In data science, it is often required to estimate dependencies between different data sources. These dependencies are typically calculated using Pearsons correlation, distance correlation, and/or mutual information. However, none of these measures satisfy all the Grangers axioms for an ideal measure. One such ideal measure, proposed by Granger himself, calculates the Bhattacharyya distance between the joint probability density function (pdf) and the product of marginal pdfs. We call this measure the mutual dependence. However, to date this measure has not been directly computable from data. In this paper, we use our recently introduced maximum likelihood non-parametric estimator for band-limited pdfs, to compute the mutual dependence directly from the data. We construct the estimator of mutual dependence and compare its performance to standard measures (Pearsons and distance correlation) for different known pdfs by computing convergence rates, computational complexity, and the ability to capture nonlinear dependencies. Our mutual dependence estimator requires fewer samples to converge to theoretical values, is faster to compute, and captures more complex dependencies than standard measures.
Nonparametric latent structure models provide flexible inference on distinct, yet related, groups of observations. Each component of a vector of $d ge 2$ random measures models the distribution of a group of exchangeable observations, while their dependence structure regulates the borrowing of information across different groups. Recent work has quantified the dependence between random measures in terms of Wasserstein distance from the maximally dependent scenario when $d=2$. By solving an intriguing max-min problem we are now able to define a Wasserstein index of dependence $I_mathcal{W}$ with the following properties: (i) it simultaneously quantifies the dependence of $d ge 2$ random measures; (ii) it takes values in [0,1]; (iii) it attains the extreme values ${0,1}$ under independence and complete dependence, respectively; (iv) since it is defined in terms of the underlying Levy measures, it is possible to evaluate it numerically in many Bayesian nonparametric models for partially exchangeable data.
Fields like public health, public policy, and social science often want to quantify the degree of dependence between variables whose relationships take on unknown functional forms. Typically, in fact, researchers in these fields are attempting to evaluate causal theories, and so want to quantify dependence after conditioning on other variables that might explain, mediate or confound causal relations. One reason conditional mutual information is not more widely used for these tasks is the lack of estimators which can handle combinations of continuous and discrete random variables, common in applications. This paper develops a new method for estimating mutual and conditional mutual information for data samples containing a mix of discrete and continuous variables. We prove that this estimator is consistent and show, via simulation, that it is more accurate than similar estimators.
We consider the problem of undirected graphical model inference. In many applications, instead of perfectly recovering the unknown graph structure, a more realistic goal is to infer some graph invariants (e.g., the maximum degree, the number of connected subgraphs, the number of isolated nodes). In this paper, we propose a new inferential framework for testing nested multiple hypotheses and constructing confidence intervals of the unknown graph invariants under undirected graphical models. Compared to perfect graph recovery, our methods require significantly weaker conditions. This paper makes two major contributions: (i) Methodologically, for testing nested multiple hypotheses, we propose a skip-down algorithm on the whole family of monotone graph invariants (The invariants which are non-decreasing under addition of edges). We further show that the same skip-down algorithm also provides valid confidence intervals for the targeted graph invariants. (ii) Theoretically, we prove that the length of the obtained confidence intervals are optimal and adaptive to the unknown signal strength. We also prove generic lower bounds for the confidence interval length for various invariants. Numerical results on both synthetic simulations and a brain imaging dataset are provided to illustrate the usefulness of the proposed method.
Statistical methods for functional data are of interest for many applications. In this paper, we prove a central limit theorem for random variables taking their values in a Hilbert space. The random variables are assumed to be weakly dependent in the sense of near epoch dependence, where the underlying process fulfills some mixing conditions. As parametric inference in an infinite dimensional space is difficult, we show that the nonoverlapping block bootstrap is consistent. Furthermore, we show how these results can be used for degenerate von Mises-statistics.
We consider the problem of designing experiments for the comparison of two regression curves describing the relation between a predictor and a response in two groups, where the data between and within the group may be dependent. In order to derive efficient designs we use results from stochastic analysis to identify the best linear unbiased estimator (BLUE) in a corresponding continuous time model. It is demonstrated that in general simultaneous estimation using the data from both groups yields more precise results than estimation of the parameters separately in the two groups. Using the BLUE from simultaneous estimation, we then construct an efficient linear estimator for finite sample size by minimizing the mean squared error between the optimal solution in the continuous time model and its discrete approximation with respect to the weights (of the linear estimator). Finally, the optimal design points are determined by minimizing the maximal width of a simultaneous confidence band for the difference of the two regression functions. The advantages of the new approach are illustrated by means of a simulation study, where it is shown that the use of the optimal designs yields substantially narrower confidence bands than the application of uniform designs.