No Arabic abstract
Outlier detection methods have become increasingly relevant in recent years due to increased security concerns and because of its vast application to different fields. Recently, Pauwels and Lasserre (2016) noticed that the sublevel sets of the inverse Christoffel function accurately depict the shape of a cloud of data using a sum-of-squares polynomial and can be used to perform outlier detection. In this work, we propose a kernelized variant of the inverse Christoffel function that makes it computationally tractable for data sets with a large number of features. We compare our approach to current methods on 15 different data sets and achieve the best average area under the precision recall curve (AUPRC) score, the best average rank and the lowest root mean square deviation.
The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed Kernel Minimum Regularized Covariance Determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.
Outliers are ubiquitous in modern data sets. Distance-based techniques are a popular non-parametric approach to outlier detection as they require no prior assumptions on the data generating distribution and are simple to implement. Scaling these techniques to massive data sets without sacrificing accuracy is a challenging task. We propose a novel algorithm based on the intuition that outliers have a significant influence on the quality of divergence-based clustering solutions. We propose sensitivity - the worst-case impact of a data point on the clustering objective - as a measure of outlierness. We then prove that influence, a (non-trivial) upper-bound on the sensitivity, can be computed by a simple linear time algorithm. To scale beyond a single machine, we propose a communication efficient distributed algorithm. In an extensive experimental evaluation, we demonstrate the effectiveness and establish the statistical significance of the proposed approach. In particular, it outperforms the most popular distance-based approaches while being several orders of magnitude faster.
We consider functional outlier detection from a geometric perspective, specifically: for functional data sets drawn from a functional manifold which is defined by the datas modes of variation in amplitude and phase. Based on this manifold, we develop a conceptualization of functional outlier detection that is more widely applicable and realistic than previously proposed. Our theoretical and experimental analyses demonstrate several important advantages of this perspective: It considerably improves theoretical understanding and allows to describe and analyse complex functional outlier scenarios consistently and in full generality, by differentiating between structurally anomalous outlier data that are off-manifold and distributionally outlying data that are on-manifold but at its margins. This improves practical feasibility of functional outlier detection: We show that simple manifold learning methods can be used to reliably infer and visualize the geometric structure of functional data sets. We also show that standard outlier detection methods requiring tabular data inputs can be applied to functional data very successfully by simply using their vector-valued representations learned from manifold learning methods as input features. Our experiments on synthetic and real data sets demonstrate that this approach leads to outlier detection performances at least on par with existing functional data-specific methods in a large variety of settings, without the highly specialized, complex methodology and narrow domain of application these methods often entail.
In practical data analysis under noisy environment, it is common to first use robust methods to identify outliers, and then to conduct further analysis after removing the outliers. In this paper, we consider statistical inference of the model estimated after outliers are removed, which can be interpreted as a selective inference (SI) problem. To use conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing methods cannot be directly used here because they are applicable to the case where the selection events can be represented by linear/quadratic constraints. In this paper, we propose a conditional SI method for popular robust regressions by using homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.
Spectral features of the empirical moment matrix constitute a resourceful tool for unveiling properties of a cloud of points, among which, density, support and latent structures. It is already well known that the empirical moment matrix encodes a great deal of subtle attributes of the underlying measure. Starting from this object as base of observations we combine ideas from statistics, real algebraic geometry, orthogonal polynomials and approximation theory for opening new insights relevant for Machine Learning (ML) problems with data supported on singular sets. Refined concepts and results from real algebraic geometry and approximation theory are empowering a simple tool (the empirical moment matrix) for the task of solving non-trivial questions in data analysis. We provide (1) theoretical support, (2) numerical experiments and, (3) connections to real world data as a validation of the stamina of the empirical moment matrix approach.