ترغب بنشر مسار تعليمي؟ اضغط هنا

From Distance Correlation to Multiscale Graph Correlation

67   0   0.0 ( 0 )
 نشر من قبل Cencheng Shen
 تاريخ النشر 2017
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation --- a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments --- to the Multiscale Graph Correlation (MGC). By utilizing the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correlations, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound Sample MGC and allows a number of desirable properties to be proved, including the universal consistency, convergence and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simulations with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better performance in general dependencies, compared to distance correlation and other popular methods.



قيم البحث

اقرأ أيضاً

Distance correlation has gained much recent attention in the data science community: the sample statistic is straightforward to compute and asymptotically equals zero if and only if independence, making it an ideal choice to discover any type of depe ndency structure given sufficient sample size. One major bottleneck is the testing process: because the null distribution of distance correlation depends on the underlying random variables and metric choice, it typically requires a permutation test to estimate the null and compute the p-value, which is very costly for large amount of data. To overcome the difficulty, in this paper we propose a chi-square test for distance correlation. Method-wise, the chi-square test is non-parametric, extremely fast, and applicable to bias-corrected distance correlation using any strong negative type metric or characteristic kernel. The test exhibits a similar testing power as the standard permutation test, and can be utilized for K-sample and partial testing. Theory-wise, we show that the underlying chi-square distribution well approximates and dominates the limiting null distribution in upper tail, prove the chi-square test can be valid and universally consistent for testing independence, and establish a testing power inequality with respect to the permutation test.
Testing whether two graphs come from the same distribution is of interest in many real world scenarios, including brain network analysis. Under the random dot product graph model, the nonparametric hypothesis testing frame-work consists of embedding the graphs using the adjacency spectral embedding (ASE), followed by aligning the embeddings using the median flip heuristic, and finally applying the nonparametric maximum mean discrepancy(MMD) test to obtain a p-value. Using synthetic data generated from Drosophila brain networks, we show that the median flip heuristic results in an invalid test, and demonstrate that optimal transport Procrustes (OTP) for alignment resolves the invalidity. We further demonstrate that substituting the MMD test with multiscale graph correlation(MGC) test leads to a more powerful test both in synthetic and in simulated data. Lastly, we apply this powerful test to the right and left hemispheres of the larval Drosophila mushroom body brain networks, and conclude that there is not sufficient evidence to reject the null hypothesis that the two hemispheres are equally distributed.
147 - Pengli Lu , Wenzhi Liu 2020
Let $G$ be a simple, connected graph, $mathcal{D}(G)$ be the distance matrix of $G$, and $Tr(G)$ be the diagonal matrix of vertex transmissions of $G$. The distance Laplacian matrix and distance signless Laplacian matrix of $G$ are defined by $mathca l{L}(G) = Tr(G)-mathcal{D}(G)$ and $mathcal{Q}(G) = Tr(G)+mathcal{D}(G)$, respectively. The eigenvalues of $mathcal{D}(G)$, $mathcal{L}(G)$ and $mathcal{Q}(G)$ is called the $mathcal{D}-$spectrum, $mathcal{L}-$spectrum and $mathcal{Q}-$spectrum, respectively. The generalized distance matrix of $G$ is defined as $mathcal{D}_{alpha}(G)=alpha Tr(G)+(1-alpha)mathcal{D}(G),~0leqalphaleq1$, and the generalized distance spectral radius of $G$ is the largest eigenvalue of $mathcal{D}_{alpha}(G)$. In this paper, we give a complete description of the $mathcal{D}-$spectrum, $mathcal{L}-$spectrum and $mathcal{Q}-$spectrum of some graphs obtained by operations. In addition, we present some new upper and lower bounds on the generalized distance spectral radius of $G$ and of its line graph $L(G)$, based on other graph-theoretic parameters, and characterize the extremal graphs. Finally, we study the generalized distance spectrum of some composite graphs.
For many machine learning problem settings, particularly with structured inputs such as sequences or sets of objects, a distance measure between inputs can be specified more naturally than a feature representation. However, most standard machine mode ls are designed for inputs with a vector feature representation. In this work, we consider the estimation of a function $f:mathcal{X} rightarrow R$ based solely on a dissimilarity measure $d:mathcal{X}timesmathcal{X} rightarrow R$ between inputs. In particular, we propose a general framework to derive a family of emph{positive definite kernels} from a given dissimilarity measure, which subsumes the widely-used emph{representative-set method} as a special case, and relates to the well-known emph{distance substitution kernel} in a limiting case. We show that functions in the corresponding Reproducing Kernel Hilbert Space (RKHS) are Lipschitz-continuous w.r.t. the given distance metric. We provide a tractable algorithm to estimate a function from this RKHS, and show that it enjoys better generalizability than Nearest-Neighbor estimates. Our approach draws from the literature of Random Features, but instead of deriving feature maps from an existing kernel, we construct novel kernels from a random feature map, that we specify given the distance measure. We conduct classification experiments with such disparate domains as strings, time series, and sets of vectors, where our proposed framework compares favorably to existing distance-based learning methods such as $k$-nearest-neighbors, distance-substitution kernels, pseudo-Euclidean embedding, and the representative-set method.
For multiple multivariate data sets, we derive conditions under which Generalized Canonical Correlation Analysis (GCCA) improves classification performance of the projected datasets, compared to standard Canonical Correlation Analysis (CCA) using onl y two data sets. We illustrate our theoretical results with simulations and a real data experiment.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا