No Arabic abstract
Change-point detection regains much attention recently for analyzing array or sequencing data for copy number variation (CNV) detection. In such applications, the true signals are typically very short and buried in the long data sequence, which makes it challenging to identify the variations efficiently and accurately. In this article, we propose a new change-point detection method, a backward procedure, which is not only fast and simple enough to exploit high-dimensional data but also performs very well for detecting short signals. Although motivated by CNV detection, the backward procedure is generally applicable to assorted change-point problems that arise in a variety of scientific applications. It is illustrated by both simulated and real CNV data that the backward detection has clear advantages over other competing methods especially when the true signal is short.
Change point detection algorithms have numerous applications in fields of scientific and economic importance. We consider the problem of change point detection on compositional multivariate data (each sample is a probability mass function), which is a practically important sub-class of general multivariate data. While the problem of change-point detection is well studied in univariate setting, and there are few viable implementations for a general multivariate data, the existing methods do not perform well on compositional data. In this paper, we propose a parametric approach for change point detection in compositional data. Moreover, using simple transformations on data, we extend our approach to handle any general multivariate data. Experimentally, we show that our method performs significantly better on compositional data and is competitive on general data compared to the available state of the art implementations.
Dynamic and temporal graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly detection in temporal graphs is crucial for many real world applications such as intrusion identification in network systems, detection of ecosystem disturbances and detection of epidemic outbreaks. In this paper, we focus on change point detection in dynamic graphs and address two main challenges associated with this problem: I) how to compare graph snapshots across time, II) how to capture temporal dependencies. To solve the above challenges, we propose Laplacian Anomaly Detection (LAD) which uses the spectrum of the Laplacian matrix of the graph structure at each snapshot to obtain low dimensional embeddings. LAD explicitly models short term and long term dependencies by applying two sliding windows. In synthetic experiments, LAD outperforms the state-of-the-art method. We also evaluate our method on three real dynamic networks: UCI message network, US senate co-sponsorship network and Canadian bill voting network. In all three datasets, we demonstrate that our method can more effectively identify anomalous time points according to significant real world events.
In many modern applications, large-scale sensor networks are used to perform statistical inference tasks. In this paper, we propose Bayesian methods for multiple change-point detection using a sensor network in which a fusion center (FC) can receive a data stream from each sensor. Due to communication limitations, the FC monitors only a subset of the sensors at each time slot. Since the number of change points can be high, we adopt the false discovery rate (FDR) criterion for controlling the rate of false alarms, while minimizing the average detection delay (ADD). We propose two Bayesian detection procedures that handle the communication limitations by monitoring the subset of the sensors with the highest posterior probabilities of change points having occurred. This monitoring policy aims to minimize the delay between the occurrence of each change point and its declaration using the corresponding posterior probabilities. One of the proposed procedures is more conservative than the second one in terms of having lower FDR at the expense of higher ADD. It is analytically shown that both procedures control the FDR under a specified tolerated level and are also scalable in the sense that they attain an ADD that does not increase asymptotically with the number of sensors. In addition, it is demonstrated that the proposed detection procedures are useful for trading off between reduced ADD and reduced average number of observations drawn until discovery. Numerical simulations are conducted for validating the analytical results and for demonstrating the properties of the proposed procedures.
Several statistical approaches based on reproducing kernels have been proposed to detect abrupt changes arising in the full distribution of the observations and not only in the mean or variance. Some of these approaches enjoy good statistical properties (oracle inequality, ldots). Nonetheless, they have a high computational cost both in terms of time and memory. This makes their application difficult even for small and medium sample sizes ($n< 10^4$). This computational issue is addressed by first describing a new efficient and exact algorithm for kernel multiple change-point detection with an improved worst-case complexity that is quadratic in time and linear in space. It allows dealing with medium size signals (up to $n approx 10^5$). Second, a faster but approximation algorithm is described. It is based on a low-rank approximation to the Gram matrix. It is linear in time and space. This approximation algorithm can be applied to large-scale signals ($n geq 10^6$). These exact and approximation algorithms have been implemented in texttt{R} and texttt{C} for various kernels. The computational and statistical performances of these new algorithms have been assessed through empirical experiments. The runtime of the new algorithms is observed to be faster than that of other considered procedures. Finally, simulations confirmed the higher statistical accuracy of kernel-based approaches to detect changes that are not only in the mean. These simulations also illustrate the flexibility of kernel-based approaches to analyze complex biological profiles made of DNA copy number and allele B frequencies. An R package implementing the approach will be made available on github.
We consider the detection and localization of change points in the distribution of an offline sequence of observations. Based on a nonparametric framework that uses a similarity graph among observations, we propose new test statistics when at most one change point occurs and generalize them to multiple change points settings. The proposed statistics leverage edge weight information in the graphs, exhibiting substantial improvements in testing power and localization accuracy in simulations. We derive the null limiting distribution, provide accurate analytic approximations to control type I error, and establish theoretical guarantees on the power consistency under contiguous alternatives for the one change point setting, as well as the minimax localization rate. In the multiple change points setting, the asymptotic correctness of the number and location of change points are also guaranteed. The methods are illustrated on the MIT proximity network data.