Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Robust Mean Estimation on Highly Incomplete Data with Arbitrary Outliers

163 0 0.0 ( 0 )

Download Cite

Added by Lunjia Hu

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Lunjia Hu - Omer Reingold

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples, where most coordinates of every example may be missing and $varepsilon N$ examples may be arbitrarily corrupted. Assuming each coordinate appears in a constant factor more than $varepsilon N$ examples, we show algorithms that estimate the mean of the distribution with information-theoretically optimal dimension-independent error guarantees in nearly-linear time $widetilde O(Nd)$. Our results extend recent work on computationally-efficient robust estimation to a more widely applicable incomplete-data setting.

rate research

Mean Isoperimetry with Control on Outliers: Exact and Approximation Algorithms

91 - Morteza Alimi , Amir Daneshgar , Mohammad-Hadi Foroughmand-Araabi 2018

Given a weighted graph $G=(V,E)$ with weight functions $c:Eto mathbb{R}_+$ and $pi:Vto mathbb{R}_+$, and a subset $Usubseteq V$, the normalized cut value for $U$ is defined as the sum of the weights of edges exiting $U$ divided by the weight of vertices in $U$. The {it mean isoperimetry problem}, $mathsf{ISO}^1(G,k)$, for a weighted graph $G$ is a generalization of the classical uniform sparsest cut problem in which, given a parameter $k$, the objective is to find $k$ disjoint nonempty subsets of $V$ minimizing the average normalized cut value of the parts. The robust version of the problem seeks an optimizer where the number of vertices that fall out of the subpartition is bounded by some given integer $0 leq rho leq |V|$. Our main result states that $mathsf{ISO}^1(G,k)$, as well as its robust version, $mathsf{CRISO}^1(G,k,rho)$, subjected to the condition that each part of the subpartition induces a connected subgraph, are solvable in time $O(k^2 rho^2 pi(V(T)^3)$ on any weighted tree $T$, in which $pi(V(T))$ is the sum of the vertex-weights. This result implies that $mathsf{ISO}^1(G,k)$ is strongly polynomial-time solvable on weighted trees when the vertex-weights are polynomially bounded and may be compared to the fact that the problem is NP-Hard for weighted trees in general. Also, using this, we show that both mentioned problems, $mathsf{ISO}^1(G,k)$ and $mathsf{CRISO}^1(G,k,rho)$ as well as the ordinary robust mean isoperimetry problem $mathsf{RISO}^1(G,k,rho)$, admit polynomial-time $O(log^{1.5}|V| loglog |V|)$-approximation algorithms for weighted graphs with polynomially bounded weights, using the R{a}cke-Shah tree cut sparsifier.

Data Structures and Algorithms

A Robust Spectral Clustering Algorithm for Sub-Gaussian Mixture Models with Outliers

73 - Prateek R. Srivastava , Purnamrita Sarkar , Grani A. Hanasusanto 2019

We consider the problem of clustering datasets in the presence of arbitrary outliers. Traditional clustering algorithms such as k-means and spectral clustering are known to perform poorly for datasets contaminated with even a small number of outliers. In this paper, we develop a provably robust spectral clustering algorithm that applies a simple rounding scheme to denoise a Gaussian kernel matrix built from the data points and uses vanilla spectral clustering to recover the cluster labels of data points. We analyze the performance of our algorithm under the assumption that the good data points are generated from a mixture of sub-gaussians (we term these inliers), while the outlier points can come from any arbitrary probability distribution. For this general class of models, we show that the misclassification error decays at an exponential rate in the signal-to-noise ratio, provided the number of outliers is a small fraction of the inlier points. Surprisingly, this derived error bound matches with the best-known bound for semidefinite programs (SDPs) under the same setting without outliers. We conduct extensive experiments on a variety of simulated and real-world datasets to demonstrate that our algorithm is less sensitive to outliers compared to other state-of-the-art algorithms proposed in the literature.

Machine Learning Machine Learning Statistics Theory

Towards Robust Medical Image Segmentation on Small-Scale Data with Incomplete Labels

100 - Nanqing Dong , Michael Kampffmeyer , Xiaodan Liang 2020

The data-driven nature of deep learning models for semantic segmentation requires a large number of pixel-level annotations. However, large-scale and fully labeled medical datasets are often unavailable for practical tasks. Recently, partially supervised methods have been proposed to utilize images with incomplete labels to mitigate the data scarcity problem in the medical domain. As an emerging research area, the breakthroughs made by existing methods rely on either large-scale data or complex model design, which makes them 1) less practical for certain real-life tasks and 2) less robust for small-scale data. It is time to step back and think about the robustness of partially supervised methods and how to maximally utilize small-scale and partially labeled data for medical image segmentation tasks. To bridge the methodological gaps in label-efficient deep learning with partial supervision, we propose RAMP, a simple yet efficient data augmentation framework for partially supervised medical image segmentation by exploiting the assumption that patients share anatomical similarities. We systematically evaluate RAMP and the previous methods in various controlled multi-structure segmentation tasks. Compared to the mainstream approaches, RAMP consistently improves the performance of traditional segmentation networks on small-scale partially labeled data and utilize additional image-wise weak annotations.

Computer Vision and Pattern Recognition Machine Learning Image and Video Processing

Robust Learning of Mixtures of Gaussians

81 - Daniel M. Kane 2020

We resolve one of the major outstanding problems in robust statistics. In particular, if $X$ is an evenly weighted mixture of two arbitrary $d$-dimensional Gaussians, we devise a polynomial time algorithm that given access to samples from $X$ an $eps$-fraction of which have been adversarially corrupted, learns $X$ to error $poly(eps)$ in total variation distance.

Data Structures and Algorithms Machine Learning Statistics Theory

Robust Matrix Completion with Mixed Data Types

137 - Daqian Sun , Martin T. Wells 2020

We consider the matrix completion problem of recovering a structured low rank matrix with partially observed entries with mixed data types. Vast majority of the solutions have proposed computationally feasible estimators with strong statistical guarantees for the case where the underlying distribution of data in the matrix is continuous. A few recent approaches have extended using similar ideas these estimators to the case where the underlying distributions belongs to the exponential family. Most of these approaches assume that there is only one underlying distribution and the low rank constraint is regularized by the matrix Schatten Norm. We propose a computationally feasible statistical approach with strong recovery guarantees along with an algorithmic framework suited for parallelization to recover a low rank matrix with partially observed entries for mixed data types in one step. We also provide extensive simulation evidence that corroborate our theoretical results.

Machine Learning Machine Learning Statistics Theory

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Robust Mean Estimation on Highly Incomplete Data with Arbitrary Outliers

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions