From Distance Correlation to Multiscale Graph Correlation

67 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Cencheng Shen

تاريخ النشر 2017

مجال البحث الاحصاء الرياضي

والبحث باللغة English

تأليف Cencheng Shen - Carey E. Priebe - Joshua T. Vogelstein

التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation --- a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments --- to the Multiscale Graph Correlation (MGC). By utilizing the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correlations, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound Sample MGC and allows a number of desirable properties to be proved, including the universal consistency, convergence and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simulations with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better performance in general dependencies, compared to distance correlation and other popular methods.

قيم البحث

94 - Cencheng Shen , Sambit Panda , Joshua T. Vogelstein 2019

Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of depe ndency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data. To overcome the difficulty, in this paper we propose a chi-square test for distance correlation. Method-wise, the chi-square test is non-parametric, extremely fast, and applicable to bias-corrected distance correlation using any strong negative type metric or characteristic kernel. The test exhibits a similar testing power as the standard permutation test, and can be utilized for K-sample and partial testing. Theory-wise, we show that the underlying chi-square distribution well approximates and dominates the limiting null distribution in upper tail, prove the chi-square test can be valid and universally consistent for testing independence, and establish a testing power inequality with respect to the permutation test.

التعلم الالي التعلم الآلي نظرية الإحصاء

Valid Two-Sample Graph Testing via Optimal Transport Procrustes and Multiscale Graph Correlation with Applications in Connectomics

118 - Jaewon Chung , Bijan Varjavand , Jesus Arroyo 2019

Testing whether two graphs come from the same distribution is of interest in many real world scenarios, including brain network analysis. Under the random dot product graph model, the nonparametric hypothesis testing frame-work consists of embedding the graphs using the adjacency spectral embedding (ASE), followed by aligning the embeddings using the median flip heuristic, and finally applying the nonparametric maximum mean discrepancy(MMD) test to obtain a p-value. Using synthetic data generated from Drosophila brain networks, we show that the median flip heuristic results in an invalid test, and demonstrate that optimal transport Procrustes (OTP) for alignment resolves the invalidity. We further demonstrate that substituting the MMD test with multiscale graph correlation(MGC) test leads to a more powerful test both in synthetic and in simulated data. Lastly, we apply this powerful test to the right and left hemispheres of the larval Drosophila mushroom body brain networks, and conclude that there is not sufficient evidence to reject the null hypothesis that the two hemispheres are equally distributed.

المنهجية

Distance matrix correlation spectrum of graphs

147 - Pengli Lu , Wenzhi Liu 2020

Let $G$ be a simple, connected graph, $mathcal{D}(G)$ be the distance matrix of $G$, and $Tr(G)$ be the diagonal matrix of vertex transmissions of $G$. The distance Laplacian matrix and distance signless Laplacian matrix of $G$ are defined by $mathca l{L}(G) = Tr(G)-mathcal{D}(G)$ and $mathcal{Q}(G) = Tr(G)+mathcal{D}(G)$, respectively. The eigenvalues of $mathcal{D}(G)$, $mathcal{L}(G)$ and $mathcal{Q}(G)$ is called the $mathcal{D}-$spectrum, $mathcal{L}-$spectrum and $mathcal{Q}-$spectrum, respectively. The generalized distance matrix of $G$ is defined as $mathcal{D}_{alpha}(G)=alpha Tr(G)+(1-alpha)mathcal{D}(G),~0leqalphaleq1$, and the generalized distance spectral radius of $G$ is the largest eigenvalue of $mathcal{D}_{alpha}(G)$. In this paper, we give a complete description of the $mathcal{D}-$spectrum, $mathcal{L}-$spectrum and $mathcal{Q}-$spectrum of some graphs obtained by operations. In addition, we present some new upper and lower bounds on the generalized distance spectral radius of $G$ and of its line graph $L(G)$, based on other graph-theoretic parameters, and characterize the extremal graphs. Finally, we study the generalized distance spectrum of some composite graphs.

التوافقية

D2KE: From Distance to Kernel and Embedding

91 - Lingfei Wu , Ian En-Hsu Yen , Fangli Xu 2018

For many machine learning problem settings, particularly with structured inputs such as sequences or sets of objects, a distance measure between inputs can be specified more naturally than a feature representation. However, most standard machine mode ls are designed for inputs with a vector feature representation. In this work, we consider the estimation of a function $f:mathcal{X} rightarrow R$ based solely on a dissimilarity measure $d:mathcal{X}timesmathcal{X} rightarrow R$ between inputs. In particular, we propose a general framework to derive a family of emph{positive definite kernels} from a given dissimilarity measure, which subsumes the widely-used emph{representative-set method} as a special case, and relates to the well-known emph{distance substitution kernel} in a limiting case. We show that functions in the corresponding Reproducing Kernel Hilbert Space (RKHS) are Lipschitz-continuous w.r.t. the given distance metric. We provide a tractable algorithm to estimate a function from this RKHS, and show that it enjoys better generalizability than Nearest-Neighbor estimates. Our approach draws from the literature of Random Features, but instead of deriving feature maps from an existing kernel, we construct novel kernels from a random feature map, that we specify given the distance measure. We conduct classification experiments with such disparate domains as strings, time series, and sets of vectors, where our proposed framework compares favorably to existing distance-based learning methods such as $k$-nearest-neighbors, distance-substitution kernels, pseudo-Euclidean embedding, and the representative-set method.

التعلم الالي التعلم الآلي

Generalized Canonical Correlation Analysis for Classification

132 - Cencheng Shen , Ming Sun , Minh Tang 2013

For multiple multivariate data sets, we derive conditions under which Generalized Canonical Correlation Analysis (GCCA) improves classification performance of the projected datasets, compared to standard Canonical Correlation Analysis (CCA) using onl y two data sets. We illustrate our theoretical results with simulations and a real data experiment.

التعلم الالي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الإتحاد الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

From Distance Correlation to Multiscale Graph Correlation

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً