ترغب بنشر مسار تعليمي؟ اضغط هنا

Random Subspace Learning Approach to High-Dimensional Outliers Detection

55   0   0.0 ( 0 )
 نشر من قبل Bohan Liu
 تاريخ النشر 2015
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like minimum covariance determinant (MCD) by computing the needed determinants and associated measures in much lower dimensional subspaces. Both theoretical and computational development of our approach reveal that it is computationally more efficient than the regularized methods in high-dimensional low-sample size, and often competes favorably with existing methods as far as the percentage of correct outlier detection is concerned.

قيم البحث

اقرأ أيضاً

The subspace approximation problem with outliers, for given $n$ points in $d$ dimensions $x_{1},ldots, x_{n} in R^{d}$, an integer $1 leq k leq d$, and an outlier parameter $0 leq alpha leq 1$, is to find a $k$-dimensional linear subspace of $R^{d}$ that minimizes the sum of squared distances to its nearest $(1-alpha)n$ points. More generally, the $ell_{p}$ subspace approximation problem with outliers minimizes the sum of $p$-th powers of distances instead of the sum of squared distances. Even the case of robust PCA is non-trivial, and previous work requires additional assumptions on the input. Any multiplicative approximation algorithm for the subspace approximation problem with outliers must solve the robust subspace recovery problem, a special case in which the $(1-alpha)n$ inliers in the optimal solution are promised to lie exactly on a $k$-dimensional linear subspace. However, robust subspace recovery is Small Set Expansion (SSE)-hard. We show how to extend dimension reduction techniques and bi-criteria approximations based on sampling to the problem of subspace approximation with outliers. To get around the SSE-hardness of robust subspace recovery, we assume that the squared distance error of the optimal $k$-dimensional subspace summed over the optimal $(1-alpha)n$ inliers is at least $delta$ times its squared-error summed over all $n$ points, for some $0 < delta leq 1 - alpha$. With this assumption, we give an efficient algorithm to find a subset of $poly(k/epsilon) log(1/delta) loglog(1/delta)$ points whose span contains a $k$-dimensional subspace that gives a multiplicative $(1+epsilon)$-approximation to the optimal solution. The running time of our algorithm is linear in $n$ and $d$. Interestingly, our results hold even when the fraction of outliers $alpha$ is large, as long as the obvious condition $0 < delta leq 1 - alpha$ is satisfied.
Outliers arise in networks due to different reasons such as fraudulent behavior of malicious users or default in measurement instruments and can significantly impair network analyses. In addition, real-life networks are likely to be incompletely obse rved, with missing links due to individual non-response or machine failures. Identifying outliers in the presence of missing links is therefore a crucial problem in network analysis. In this work, we introduce a new algorithm to detect outliers in a network that simultaneously predicts the missing links. The proposed method is statistically sound: we prove that, under fairly general assumptions, our algorithm exactly detects the outliers, and achieves the best known error for the prediction of missing links with polynomial computation cost. It is also computationally efficient: we prove sub-linear convergence of our algorithm. We provide a simulation study which demonstrates the good behavior of the algorithm in terms of outliers detection and prediction of the missing links. We also illustrate the method with an application in epidemiology, and with the analysis of a political Twitter network. The method is freely available as an R package on the Comprehensive R Archive Network.
158 - Tyler Maunu , Gilad Lerman 2019
We study the problem of robust subspace recovery (RSR) in the presence of adversarial outliers. That is, we seek a subspace that contains a large portion of a dataset when some fraction of the data points are arbitrarily corrupted. We first examine a theoretical estimator that is intractable to calculate and use it to derive information-theoretic bounds of exact recovery. We then propose two tractable estimators: a variant of RANSAC and a simple relaxation of the theoretical estimator. The two estimators are fast to compute and achieve state-of-the-art theoretical performance in a noiseless RSR setting with adversarial outliers. The former estimator achieves better theoretical guarantees in the noiseless case, while the latter estimator is robust to small noise, and its guarantees significantly improve with non-adversarial models of outliers. We give a complete comparison of guarantees for the adversarial RSR problem, as well as a short discussion on the estimation of affine subspaces.
115 - Yunbo Ouyang , Feng Liang 2017
We propose an empirical Bayes estimator based on Dirichlet process mixture model for estimating the sparse normalized mean difference, which could be directly applied to the high dimensional linear classification. In theory, we build a bridge to conn ect the estimation error of the mean difference and the misclassification error, also provide sufficient conditions of sub-optimal classifiers and optimal classifiers. In implementation, a variational Bayes algorithm is developed to compute the posterior efficiently and could be parallelized to deal with the ultra-high dimensional case.
Mice vocalize in the ultrasonic range during social interactions. These vocalizations are used in neuroscience and clinical studies to tap into complex behaviors and states. The analysis of these ultrasonic vocalizations (USVs) has been traditionally a manual process, which is prone to errors and human bias, and is not scalable to large scale analysis. We propose a new method to automatically create a dictionary of USVs based on a two-step spectral clustering approach, where we split the set of USVs into inlier and outlier data sets. This approach is motivated by the known degrading performance of sparse subspace clustering with outliers. We apply spectral clustering to the inlier data set and later find the clusters for the outliers. We propose quantitative and qualitative performance measures to evaluate our method in this setting, where there is no ground truth. Our approach outperforms two baselines based on k-means and spectral clustering in all of the proposed performance measures, showing greater distances between clusters and more variability between clusters.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا