No Arabic abstract
When both the difference between two quantities and their individual values can be measured or computational predicted, multiple quantities can be determined from the measurements or predictions of select individual quantities and select pairwise differences. These measurements and predictions form a network connecting the quantities through their differences. Here, I analyze the optimization of such networks, where the trace ($A$-optimal), the largest eigenvalue ($E$-optimal), or the determinant ($D$-optimal) of the covariance matrix associated with the estimated quantities are minimized with respect to the allocation of the measurement (or computational) cost to different measurements (or predictions). My statistical analysis of the performance of such optimal measurement networks -- based on large sets of simulated data -- suggests that they substantially accelerate the determination of the quantities, and that they may be useful in applications such as the computational prediction of binding free energies of candidate drug molecules.
Alchemical binding free energy (BFE) calculations offer an efficient and thermodynamically rigorous approach to in silico binding affinity predictions. As a result of decades of methodological improvements and recent advances in computer technology, alchemical BFE calculations are now widely used in drug discovery research. They help guide the prioritization of candidate drug molecules by predicting their binding affinities for a biomolecular target of interest (and potentially selectivity against undesirable anti-targets). Statistical variance associated with such calculations, however, may undermine the reliability of their predictions, introducing uncertainty both in ranking candidate molecules and in benchmarking their predictive accuracy. Here, we present a computational method that substantially improves the statistical precision in BFE calculations for a set of ligands binding to a common receptor by dynamically allocating computational resources to different BFE calculations according to an optimality objective established in a previous work from our group and extended in this work. Our method, termed Network Binding Free Energy (NetBFE), performs adaptive binding free energy calculations in iterations, re-optimizing the allocations in each iteration based on the statistical variances estimated from previous iterations. Using examples of NetBFE calculations for protein-binding of congeneric ligand series, we demonstrate that NetBFE approaches the optimal allocation in a small number (<= 5) of iterations and that NetBFE reduces the statistical variance in the binding free energy estimates by approximately a factor of two when compared to a previously published and widely used allocation method at the same total computational cost.
In many applications it is important to know whether the amount of fluctuation in a series of observations changes over time. In this article, we investigate different tests for detecting change in the scale of mean-stationary time series. The classical approach based on the CUSUM test applied to the squared centered, is very vulnerable to outliers and impractical for heavy-tailed data, which leads us to contemplate test statistics based on alternative, less outlier-sensitive scale estimators. It turns out that the tests based on Ginis mean difference (the average of all pairwise distances) or generalized Qn estimators (sample quantiles of all pairwise distances) are very suitable candidates. They improve upon the classical test not only under heavy tails or in the presence of outliers, but also under normality. An explanation for this at first counterintuitive result is that the corresponding long-run variance estimates are less affected by a scale change than in the case of the sample-variance-based test. We use recent results on the process convergence of U-statistics and U-quantiles for dependent sequences to derive the limiting distribution of the test statistics and propose estimators for the long-run variance. We perform a simulations study to investigate the finite sample behavior of the test and their power. Furthermore, we demonstrate the applicability of the new change-point detection methods at two real-life data examples from hydrology and finance.
Network similarity measures quantify how and when two networks are symmetrically related, including measures of statistical association such as pairwise distance or other correlation measures between networks or between the layers of a multiplex network, but neither can directly unveil whether there are hidden confounding network factors nor can they estimate when such correlation is underpinned by a causal relation. In this work we extend this pairwise conceptual framework to triplets of networks and quantify how and when a network is related to a second network directly or via the indirect mediation or interaction with a third network. Accordingly, we develop a simple and intuitive set-theoretic approach to quantify mediation and suppression between networks. We validate our theory with synthetic models and further apply it to triplets of real-world networks, unveiling mediation and suppression effects which emerge when considering different modes of interaction in online social networks and different routes of information processing in the brain.
The complexity underlying real-world systems implies that standard statistical hypothesis testing methods may not be adequate for these peculiar applications. Specifically, we show that the likelihood-ratio tests null-distribution needs to be modified to accommodate the complexity found in multi-edge network data. When working with independent observations, the p-values of likelihood-ratio tests are approximated using a $chi^2$ distribution. However, such an approximation should not be used when dealing with multi-edge network data. This type of data is characterized by multiple correlations and competitions that make the standard approximation unsuitable. We provide a solution to the problem by providing a better approximation of the likelihood-ratio test null-distribution through a Beta distribution. Finally, we empirically show that even for a small multi-edge network, the standard $chi^2$ approximation provides erroneous results, while the proposed Beta approximation yields the correct p-value estimation.
The normal distribution is used as a unified probability distribution, however, our researcher found that it is not good agreed with the real-life dynamical systems data. We collected and analyzed representative naturally occurring data series (e.g., the earth environment, sunspots, brain waves, electrocardiograms, some cases are classic chaos systems and social activities). It is found that the probability density functions (PDFs) of first or higher order differences for these datasets are consistently fat-tailed bell-shaped curves, and their associated cumulative distribution functions (CDFs) are consistently S-shaped when compared to the near-straight line of the normal distribution CDF. It is proved that this profile is not because of numerical or measure error, and the t-distribution is a good approximation. This kind of PDF/CDF is a universal phenomenon for independent time and space series data, which will make researchers to reconsider some hypotheses about stochastic dynamical models such as Wiener process, and therefore merits investigation.