Do you want to publish a course? Click here

Robust Mean Estimation on Highly Incomplete Data with Arbitrary Outliers

163   0   0.0 ( 0 )
 Added by Lunjia Hu
 Publication date 2020
and research's language is English




Ask ChatGPT about the research

We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples, where most coordinates of every example may be missing and $varepsilon N$ examples may be arbitrarily corrupted. Assuming each coordinate appears in a constant factor more than $varepsilon N$ examples, we show algorithms that estimate the mean of the distribution with information-theoretically optimal dimension-independent error guarantees in nearly-linear time $widetilde O(Nd)$. Our results extend recent work on computationally-efficient robust estimation to a more widely applicable incomplete-data setting.



rate research

Read More

Given a weighted graph $G=(V,E)$ with weight functions $c:Eto mathbb{R}_+$ and $pi:Vto mathbb{R}_+$, and a subset $Usubseteq V$, the normalized cut value for $U$ is defined as the sum of the weights of edges exiting $U$ divided by the weight of vertices in $U$. The {it mean isoperimetry problem}, $mathsf{ISO}^1(G,k)$, for a weighted graph $G$ is a generalization of the classical uniform sparsest cut problem in which, given a parameter $k$, the objective is to find $k$ disjoint nonempty subsets of $V$ minimizing the average normalized cut value of the parts. The robust version of the problem seeks an optimizer where the number of vertices that fall out of the subpartition is bounded by some given integer $0 leq rho leq |V|$. Our main result states that $mathsf{ISO}^1(G,k)$, as well as its robust version, $mathsf{CRISO}^1(G,k,rho)$, subjected to the condition that each part of the subpartition induces a connected subgraph, are solvable in time $O(k^2 rho^2 pi(V(T)^3)$ on any weighted tree $T$, in which $pi(V(T))$ is the sum of the vertex-weights. This result implies that $mathsf{ISO}^1(G,k)$ is strongly polynomial-time solvable on weighted trees when the vertex-weights are polynomially bounded and may be compared to the fact that the problem is NP-Hard for weighted trees in general. Also, using this, we show that both mentioned problems, $mathsf{ISO}^1(G,k)$ and $mathsf{CRISO}^1(G,k,rho)$ as well as the ordinary robust mean isoperimetry problem $mathsf{RISO}^1(G,k,rho)$, admit polynomial-time $O(log^{1.5}|V| loglog |V|)$-approximation algorithms for weighted graphs with polynomially bounded weights, using the R{a}cke-Shah tree cut sparsifier.
We consider the problem of clustering datasets in the presence of arbitrary outliers. Traditional clustering algorithms such as k-means and spectral clustering are known to perform poorly for datasets contaminated with even a small number of outliers. In this paper, we develop a provably robust spectral clustering algorithm that applies a simple rounding scheme to denoise a Gaussian kernel matrix built from the data points and uses vanilla spectral clustering to recover the cluster labels of data points. We analyze the performance of our algorithm under the assumption that the good data points are generated from a mixture of sub-gaussians (we term these inliers), while the outlier points can come from any arbitrary probability distribution. For this general class of models, we show that the misclassification error decays at an exponential rate in the signal-to-noise ratio, provided the number of outliers is a small fraction of the inlier points. Surprisingly, this derived error bound matches with the best-known bound for semidefinite programs (SDPs) under the same setting without outliers. We conduct extensive experiments on a variety of simulated and real-world datasets to demonstrate that our algorithm is less sensitive to outliers compared to other state-of-the-art algorithms proposed in the literature.
The data-driven nature of deep learning models for semantic segmentation requires a large number of pixel-level annotations. However, large-scale and fully labeled medical datasets are often unavailable for practical tasks. Recently, partially supervised methods have been proposed to utilize images with incomplete labels to mitigate the data scarcity problem in the medical domain. As an emerging research area, the breakthroughs made by existing methods rely on either large-scale data or complex model design, which makes them 1) less practical for certain real-life tasks and 2) less robust for small-scale data. It is time to step back and think about the robustness of partially supervised methods and how to maximally utilize small-scale and partially labeled data for medical image segmentation tasks. To bridge the methodological gaps in label-efficient deep learning with partial supervision, we propose RAMP, a simple yet efficient data augmentation framework for partially supervised medical image segmentation by exploiting the assumption that patients share anatomical similarities. We systematically evaluate RAMP and the previous methods in various controlled multi-structure segmentation tasks. Compared to the mainstream approaches, RAMP consistently improves the performance of traditional segmentation networks on small-scale partially labeled data and utilize additional image-wise weak annotations.
81 - Daniel M. Kane 2020
We resolve one of the major outstanding problems in robust statistics. In particular, if $X$ is an evenly weighted mixture of two arbitrary $d$-dimensional Gaussians, we devise a polynomial time algorithm that given access to samples from $X$ an $eps$-fraction of which have been adversarially corrupted, learns $X$ to error $poly(eps)$ in total variation distance.
We consider the matrix completion problem of recovering a structured low rank matrix with partially observed entries with mixed data types. Vast majority of the solutions have proposed computationally feasible estimators with strong statistical guarantees for the case where the underlying distribution of data in the matrix is continuous. A few recent approaches have extended using similar ideas these estimators to the case where the underlying distributions belongs to the exponential family. Most of these approaches assume that there is only one underlying distribution and the low rank constraint is regularized by the matrix Schatten Norm. We propose a computationally feasible statistical approach with strong recovery guarantees along with an algorithmic framework suited for parallelization to recover a low rank matrix with partially observed entries for mixed data types in one step. We also provide extensive simulation evidence that corroborate our theoretical results.

suggested questions

comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا