No Arabic abstract
How might one test the hypothesis that graphs were sampled from the same distribution? Here, we compare two statistical tests that address this question. The first uses the observed subgraph densities themselves as estimates of those of the underlying distribution. The second test uses a new approach that converts these subgraph densities into estimates of the graph cumulants of the distribution. We demonstrate -- via theory, simulation, and application to real data -- the superior statistical power of using graph cumulants.
Frequently, a set of objects has to be evaluated by a panel of assessors, but not every object is assessed by every assessor. A problem facing such panels is how to take into account different standards amongst panel members and varying levels of confidence in their scores. Here, a mathematically-based algorithm is developed to calibrate the scores of such assessors, addressing both of these issues. The algorithm is based on the connectivity of the graph of assessors and objects evaluated, incorporating declared confidences as weights on its edges. If the graph is sufficiently well connected, relative standards can be inferred by comparing how assessors rate objects they assess in common, weighted by the levels of confidence of each assessment. By removing these biases, true values are inferred for all the objects. Reliability estimates for the resulting values are obtained. The algorithm is tested in two case studies, one by computer simulation and another based on realistic evaluation data. The process is compared to the simple averaging procedure in widespread use, and to Fishers additive incomplete block analysis. It is anticipated that the algorithm will prove useful in a wide variety of situations such as evaluation of the quality of research submitted to national assessment exercises; appraisal of grant proposals submitted to funding panels; ranking of job applicants; and judgement of performances on degree courses wherein candidates can choose from lists of options.
The popular Alternating Least Squares (ALS) algorithm for tensor decomposition is efficient and easy to implement, but often converges to poor local optima---particularly when the weights of the factors are non-uniform. We propose a modification of the ALS approach that is as efficient as standard ALS, but provably recovers the true factors with random initialization under standard incoherence assumptions on the factors of the tensor. We demonstrate the significant practical superiority of our approach over traditional ALS for a variety of tasks on synthetic data---including tensor factorization on exact, noisy and over-complete tensors, as well as tensor completion---and for computing word embeddings from a third-order word tri-occurrence tensor.
A multilayer network depicts different types of interactions among the same set of nodes. For example, protease networks consist of five to seven layers, where different layers represent distinct types of experimentally confirmed molecule interactions among proteins. In a multilayer protease network, the co-expression layer is obtained through the meta-analysis of transcriptomic data from various sources and platforms. While in some researches the co-expression layer is in turn represented as a multilayered network, a fundamental problem is how to obtain a single-layer network from the corresponding multilayered network. This process is called multilayer network aggregation. In this work, we propose a maximum a posteriori estimation-based algorithm for multilayer network aggregation. The method allows to aggregate a weighted multilayer network while conserving the core information of the layers. We evaluate the method through an unweighted friendship network and a multilayer gene co-expression network. We compare the aggregated gene co-expression network with a network obtained from conflated datasets and a network obtained from averaged weights. The Von Neumann entropy is adopted to compare the mixedness of the three networks, and, together with other network measurements, shows the effectiveness of the proposes method.
The complexity underlying real-world systems implies that standard statistical hypothesis testing methods may not be adequate for these peculiar applications. Specifically, we show that the likelihood-ratio tests null-distribution needs to be modified to accommodate the complexity found in multi-edge network data. When working with independent observations, the p-values of likelihood-ratio tests are approximated using a $chi^2$ distribution. However, such an approximation should not be used when dealing with multi-edge network data. This type of data is characterized by multiple correlations and competitions that make the standard approximation unsuitable. We provide a solution to the problem by providing a better approximation of the likelihood-ratio test null-distribution through a Beta distribution. Finally, we empirically show that even for a small multi-edge network, the standard $chi^2$ approximation provides erroneous results, while the proposed Beta approximation yields the correct p-value estimation.
We consider settings in which the data of interest correspond to pairs of ordered times, e.g, the birth times of the first and second child, the times at which a new user creates an account and makes the first purchase on a website, and the entry and survival times of patients in a clinical trial. In these settings, the two times are not independent (the second occurs after the first), yet it is still of interest to determine whether there exists significant dependence {em beyond} their ordering in time. We refer to this notion as quasi-(in)dependence. For instance, in a clinical trial, to avoid biased selection, we might wish to verify that recruitment times are quasi-independent of survival times, where dependencies might arise due to seasonal effects. In this paper, we propose a nonparametric statistical test of quasi-independence. Our test considers a potentially infinite space of alternatives, making it suitable for complex data where the nature of the possible quasi-dependence is not known in advance. Standard parametric approaches are recovered as special cases, such as the classical conditional Kendalls tau, and log-rank tests. The tests apply in the right-censored setting: an essential feature in clinical trials, where patients can withdraw from the study. We provide an asymptotic analysis of our test-statistic, and demonstrate in experiments that our test obtains better power than existing approaches, while being more computationally efficient.