Do you want to publish a course? Click here

Testing for Association in Multi-View Network Data

97   0   0.0 ( 0 )
 Added by Lucy Gao
 Publication date 2019
and research's language is English




Ask ChatGPT about the research

In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein-protein interaction data from the HINT database (Das and Hint, 2012). We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to co-complex association data. We also extend this proposal to the setting of a network with node covariates.



rate research

Read More

We present a general framework for hypothesis testing on distributions of sets of individual examples. Sets may represent many common data sources such as groups of observations in time series, collections of words in text or a batch of images of a given phenomenon. This observation pattern, however, differs from the common assumptions required for hypothesis testing: each set differs in size, may have differing levels of noise, and also may incorporate nuisance variability, irrelevant for the analysis of the phenomenon of interest; all features that bias test decisions if not accounted for. In this paper, we propose to interpret sets as independent samples from a collection of latent probability distributions, and introduce kernel two-sample and independence tests in this latent space of distributions. We prove the consistency of tests and observe them to outperform in a wide range of synthetic experiments. Finally, we showcase their use in practice with experiments of healthcare and climate data, where previously heuristics were needed for feature extraction and testing.
The practice of pooling several individual test statistics to form aggregate tests is common in many statistical application where individual tests may be underpowered. While selection by aggregate tests can serve to increase power, the selection process invalidates the individual test-statistics, making it difficult to identify the ones that drive the signal in follow-up inference. Here, we develop a general approach for valid inference following selection by aggregate testing. We present novel powerful post-selection tests for the individual null hypotheses which are exact for the normal model and asymptotically justified otherwise. Our approach relies on the ability to characterize the distribution of the individual test statistics after conditioning on the event of selection. We provide efficient algorithms for estimation of the post-selection maximum-likelihood estimates and suggest confidence intervals which rely on a novel switching regime for good coverage guarantees. We validate our methods via comprehensive simulation studies and apply them to data from the Dallas Heart Study, demonstrating that single variant association discovery following selection by an aggregated test is indeed possible in practice.
HIV-1C is the most prevalent subtype of HIV-1 and accounts for over half of HIV-1 infections worldwide. Host genetic influence of HIV infection has been previously studied in HIV-1B, but little attention has been paid to the more prevalent subtype C. To understand the role of host genetics in HIV-1C disease progression, we perform a study to assess the association between longitudinally collected measures of disease and more than 100,000 genetic markers located on chromosome 6. The most common approach to analyzing longitudinal data in this context is linear mixed effects models, which may be overly simplistic in this case. On the other hand, existing non-parametric methods may suffer from low power due to high degrees of freedom (DF) and may be computationally infeasible at the large scale. We propose a functional principal variance component (FPVC) testing framework which captures the nonlinearity in the CD4 and viral load with potentially low DF and is fast enough to carry out thousands or millions of times. The FPVC testing unfolds in two stages. In the first stage, we summarize the markers of disease progression according to their major patterns of variation via functional principal components analysis (FPCA). In the second stage, we employ a simple working model and variance component testing to examine the association between the summaries of disease progression and a set of single nucleotide polymorphisms. We supplement this analysis with simulation results which indicate that FPVC testing can offer large power gains over the standard linear mixed effects model.
142 - Jin Liu , Can Yang , Xingjie Shi 2013
In genome-wide association studies (GWAS), penalization is an important approach for identifying genetic markers associated with trait while mixed model is successful in accounting for a complicated dependence structure among samples. Therefore, penalized linear mixed model is a tool that combines the advantages of penalization approach and linear mixed model. In this study, a GWAS with multiple highly correlated traits is analyzed. For GWAS with multiple quantitative traits that are highly correlated, the analysis using traits marginally inevitably lose some essential information among multiple traits. We propose a penalized-MTMM, a penalized multivariate linear mixed model that allows both the within-trait and between-trait variance components simultaneously for multiple traits. The proposed penalized-MTMM estimates variance components using an AI-REML method and conducts variable selection and point estimation simultaneously using group MCP and sparse group MCP. Best linear unbiased predictor (BLUP) is used to find predictive values and the Pearsons correlations between predictive values and their corresponding observations are used to evaluate prediction performance. Both prediction and selection performance of the proposed approach and its comparison with the uni-trait penalized-LMM are evaluated through simulation studies. We apply the proposed approach to a GWAS data from Genetic Analysis Workshop (GAW) 18.
115 - Bowen Gang , Wenguang Sun , 2020
Consider the online testing of a stream of hypotheses where a real--time decision must be made before the next data point arrives. The error rate is required to be controlled at {all} decision points. Conventional emph{simultaneous testing rules} are no longer applicable due to the more stringent error constraints and absence of future data. Moreover, the online decision--making process may come to a halt when the total error budget, or alpha--wealth, is exhausted. This work develops a new class of structure--adaptive sequential testing (SAST) rules for online false discover rate (FDR) control. A key element in our proposal is a new alpha--investment algorithm that precisely characterizes the gains and losses in sequential decision making. SAST captures time varying structures of the data stream, learns the optimal threshold adaptively in an ongoing manner and optimizes the alpha-wealth allocation across different time periods. We present theory and numerical results to show that the proposed method is valid for online FDR control and achieves substantial power gain over existing online testing rules.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا