The Chi-Square Test of Distance Correlation

95 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Cencheng Shen

تاريخ النشر 2019

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Cencheng Shen - Sambit Panda - Joshua T. Vogelstein

التعلم الالي التعلم الآلي نظرية الإحصاء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of dependency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data. To overcome the difficulty, in this paper we propose a chi-square test for distance correlation. Method-wise, the chi-square test is non-parametric, extremely fast, and applicable to bias-corrected distance correlation using any strong negative type metric or characteristic kernel. The test exhibits a similar testing power as the standard permutation test, and can be utilized for K-sample and partial testing. Theory-wise, we show that the underlying chi-square distribution well approximates and dominates the limiting null distribution in upper tail, prove the chi-square test can be valid and universally consistent for testing independence, and establish a testing power inequality with respect to the permutation test.

قيم البحث

287 - Pierre Gaillard 2018

We consider the setting of online linear regression for arbitrary deterministic sequences, with the square loss. We are interested in the aim set by Bartlett et al. (2015): obtain regret bounds that hold uniformly over all competitor vectors. When th e feature sequence is known at the beginning of the game, they provided closed-form regret bounds of $2d B^2 ln T + mathcal{O}_T(1)$, where $T$ is the number of rounds and $B$ is a bound on the observations. Instead, we derive bounds with an optimal constant of $1$ in front of the $d B^2 ln T$ term. In the case of sequentially revealed features, we also derive an asymptotic regret bound of $d B^2 ln T$ for any individual sequence of features and bounded observations. All our algorithms are variants of the online non-linear ridge regression forecaster, either with a data-dependent regularization or with almost no regularization.

التعلم الالي التعلم الآلي نظرية الإحصاء

FED-$chi^2$: Privacy Preserving Federated Correlation Test

101 - Lun Wang , Qi Pang , Shuai Wang 2021

In this paper, we propose the first secure federated $chi^2$-test protocol Fed-$chi^2$. To minimize both the privacy leakage and the communication cost, we recast $chi^2$-test to the second moment estimation problem and thus can take advantage of sta ble projection to encode the local information in a short vector. As such encodings can be aggregated with only summation, secure aggregation can be naturally applied to hide the individual updates. We formally prove the security guarantee of Fed-$chi^2$ that the joint distribution is hidden in a subspace with exponential possible distributions. Our evaluation results show that Fed-$chi^2$ achieves negligible accuracy drops with small client-side computation overhead. In several real-world case studies, the performance of Fed-$chi^2$ is comparable to the centralized $chi^2$-test.

التشفير والأمن النظم الموزعة والتوازية والحوسبة العنقودية

From Distance Correlation to Multiscale Graph Correlation

66 - Cencheng Shen , Carey E. Priebe , Joshua T. Vogelstein 2017

Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation --- a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments --- to the Multiscale Graph Correlation (MGC). By utilizing the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correlations, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound Sample MGC and allows a number of desirable properties to be proved, including the universal consistency, convergence and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simulations with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better performance in general dependencies, compared to distance correlation and other popular methods.

التعلم الالي

A new test procedure of independence in copula models via chi-square-divergence

163 - Salim Bouzebda 2011

We introduce a new test procedure of independence in the framework of parametric copulas with unknown marginals. The method is based essentially on the dual representation of $chi^2$-divergence on signed finite measures. The asymptotic properties of the proposed estimate and the test statistic are studied under the null and alternative hypotheses, with simple and standard limit distributions both when the parameter is an interior point or not.

نظرية الإحصاء نظرية الإحصاء

Measuring the Algorithmic Convergence of Randomized Ensembles: The Regression Setting

432 - Miles E. Lopes , Suofei Wu , Thomas C. M. Lee 2019

When randomized ensemble methods such as bagging and random forests are implemented, a basic question arises: Is the ensemble large enough? In particular, the practitioner desires a rigorous guarantee that a given ensemble will perform nearly as well as an ideal infinite ensemble (trained on the same data). The purpose of the current paper is to develop a bootstrap method for solving this problem in the context of regression --- which complements our companion paper in the context of classification (Lopes 2019). In contrast to the classification setting, the current paper shows that theoretical guarantees for the proposed bootstrap can be established under much weaker assumptions. In addition, we illustrate the flexibility of the method by showing how it can be adapted to measure algorithmic convergence for variable selection. Lastly, we provide numerical results demonstrating that the method works well in a range of situations.

التعلم الالي التعلم الآلي نظرية الإحصاء