No Arabic abstract
Making binary decisions is a common data analytical task in scientific research and industrial applications. In data sciences, there are two related but distinct strategies: hypothesis testing and binary classification. In practice, how to choose between these two strategies can be unclear and rather confusing. Here we summarize key distinctions between these two strategies in three aspects and list five practical guidelines for data analysts to choose the appropriate strategy for specific analysis needs. We demonstrate the use of those guidelines in a cancer driver gene prediction example.
Persistent homology is a vital tool for topological data analysis. Previous work has developed some statistical estimators for characteristics of collections of persistence diagrams. However, tools that provide statistical inference for observations that are persistence diagrams are limited. Specifically, there is a need for tests that can assess the strength of evidence against a claim that two samples arise from the same population or process. We propose the use of randomization-style null hypothesis significance tests (NHST) for these situations. The test is based on a loss function that comprises pairwise distances between the elements of each sample and all the elements in the other sample. We use this method to analyze a range of simulated and experimental data. Through these examples we experimentally explore the power of the p-values. Our results show that the randomization-style NHST based on pairwise distances can distinguish between samples from different processes, which suggests that its use for hypothesis tests upon persistence diagrams is reasonable. We demonstrate its application on a real dataset of fMRI data of patients with ADHD.
We describe the utility of point processes and failure rates and the most common point process for modeling failure rates, the Poisson point process. Next, we describe the uniformly most powerful test for comparing the rates of two Poisson point processes for a one-sided test (henceforth referred to as the rate test). A common argument against using this test is that real world data rarely follows the Poisson point process. We thus investigate what happens when the distributional assumptions of tests like these are violated and the test still applied. We find a non-pathological example (using the rate test on a Compound Poisson distribution with Binomial compounding) where violating the distributional assumptions of the rate test make it perform better (lower error rates). We also find that if we replace the distribution of the test statistic under the null hypothesis with any other arbitrary distribution, the performance of the test (described in terms of the false negative rate to false positive rate trade-off) remains exactly the same. Next, we compare the performance of the rate test to a version of the Wald test customized to the Negative Binomial point process and find it to perform very similarly while being much more general and versatile. Finally, we discuss the applications to Microsoft Azure. The code for all experiments performed is open source and linked in the introduction.
Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov--Smirnov-type goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford--Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton--Watson related processes.
We consider the problem of distributed binary hypothesis testing of two sequences that are generated by an i.i.d. doubly-binary symmetric source. Each sequence is observed by a different terminal. The two hypotheses correspond to different levels of correlation between the two source components, i.e., the crossover probability between the two. The terminals communicate with a decision function via rate-limited noiseless links. We analyze the tradeoff between the exponential decay of the two error probabilities associated with the hypothesis test and the communication rates. We first consider the side-information setting where one encoder is allowed to send the full sequence. For this setting, previous work exploits the fact that a decoding error of the source does not necessarily lead to an erroneous decision upon the hypothesis. We provide improved achievability results by carrying out a tighter analysis of the effect of binning error; the results are also more complete as they cover the full exponent tradeoff and all possible correlations. We then turn to the setting of symmetric rates for which we utilize Korner-Marton coding to generalize the results, with little degradation with respect to the performance with a one-sided constraint (side-information setting).
The classical binary hypothesis testing problem is revisited. We notice that when one of the hypotheses is composite, there is an inherent difficulty in defining an optimality criterion that is both informative and well-justified. For testing in the simple normal location problem (that is, testing for the mean of multivariate Gaussians), we overcome the difficulty as follows. In this problem there exists a natural hardness order between parameters as for different parameters the error-probailities curves (when the parameter is known) are either identical, or one dominates the other. We can thus define minimax performance as the worst-case among parameters which are below some hardness level. Fortunately, there exists a universal minimax test, in the sense that it is minimax for all hardness levels simultaneously. Under this criterion we also find the optimal test for composite hypothesis testing with training data. This criterion extends to the wide class of local asymptotic normal models, in an asymptotic sense where the approximation of the error probabilities is additive. Since we have the asymptotically optimal tests for composite hypothesis testing with and without training data, we quantify the loss of universality and gain of training data for these models.