No Arabic abstract
The subspace approximation problem with outliers, for given $n$ points in $d$ dimensions $x_{1},ldots, x_{n} in R^{d}$, an integer $1 leq k leq d$, and an outlier parameter $0 leq alpha leq 1$, is to find a $k$-dimensional linear subspace of $R^{d}$ that minimizes the sum of squared distances to its nearest $(1-alpha)n$ points. More generally, the $ell_{p}$ subspace approximation problem with outliers minimizes the sum of $p$-th powers of distances instead of the sum of squared distances. Even the case of robust PCA is non-trivial, and previous work requires additional assumptions on the input. Any multiplicative approximation algorithm for the subspace approximation problem with outliers must solve the robust subspace recovery problem, a special case in which the $(1-alpha)n$ inliers in the optimal solution are promised to lie exactly on a $k$-dimensional linear subspace. However, robust subspace recovery is Small Set Expansion (SSE)-hard. We show how to extend dimension reduction techniques and bi-criteria approximations based on sampling to the problem of subspace approximation with outliers. To get around the SSE-hardness of robust subspace recovery, we assume that the squared distance error of the optimal $k$-dimensional subspace summed over the optimal $(1-alpha)n$ inliers is at least $delta$ times its squared-error summed over all $n$ points, for some $0 < delta leq 1 - alpha$. With this assumption, we give an efficient algorithm to find a subset of $poly(k/epsilon) log(1/delta) loglog(1/delta)$ points whose span contains a $k$-dimensional subspace that gives a multiplicative $(1+epsilon)$-approximation to the optimal solution. The running time of our algorithm is linear in $n$ and $d$. Interestingly, our results hold even when the fraction of outliers $alpha$ is large, as long as the obvious condition $0 < delta leq 1 - alpha$ is satisfied.
We study the problem of robust subspace recovery (RSR) in the presence of adversarial outliers. That is, we seek a subspace that contains a large portion of a dataset when some fraction of the data points are arbitrarily corrupted. We first examine a theoretical estimator that is intractable to calculate and use it to derive information-theoretic bounds of exact recovery. We then propose two tractable estimators: a variant of RANSAC and a simple relaxation of the theoretical estimator. The two estimators are fast to compute and achieve state-of-the-art theoretical performance in a noiseless RSR setting with adversarial outliers. The former estimator achieves better theoretical guarantees in the noiseless case, while the latter estimator is robust to small noise, and its guarantees significantly improve with non-adversarial models of outliers. We give a complete comparison of guarantees for the adversarial RSR problem, as well as a short discussion on the estimation of affine subspaces.
In the Euclidean TSP with neighborhoods (TSPN), we are given a collection of n regions (neighborhoods) and we seek a shortest tour that visits each region. As a generalization of the classical Euclidean TSP, TSPN is also NP-hard. In this paper, we present new approximation results for the TSPN, including (1) a constant-factor approximation algorithm for the case of arbitrary connected neighborhoods having comparable diameters; and (2) a PTAS for the important special case of disjoint unit disk neighborhoods (or nearly disjoint, nearly-unit disks). Our methods also yield improved approximation ratios for various special classes of neighborhoods, which have previously been studied. Further, we give a linear-time O(1)-approximation algorithm for the case of neighborhoods that are (infinite) straight lines.
We improve the running times of $O(1)$-approximation algorithms for the set cover problem in geometric settings, specifically, covering points by disks in the plane, or covering points by halfspaces in three dimensions. In the unweighted case, Agarwal and Pan [SoCG 2014] gave a randomized $O(nlog^4 n)$-time, $O(1)$-approximation algorithm, by using variants of the multiplicative weight update (MWU) method combined with geometric data structures. We simplify the data structure requirement in one of their methods and obtain a deterministic $O(nlog^3 nloglog n)$-time algorithm. With further new ideas, we obtain a still faster randomized $O(nlog n(loglog n)^{O(1)})$-time algorithm. For the weighted problem, we also give a randomized $O(nlog^4nloglog n)$-time, $O(1)$-approximation algorithm, by simple modifications to the MWU method and the quasi-uniform sampling technique.
Given a set of $n$ terminals, which are points in $d$-dimensional Euclidean space, the minimum Manhattan network problem (MMN) asks for a minimum-length rectilinear network that connects each pair of terminals by a Manhattan path, that is, a path consisting of axis-parallel segments whose total length equals the pairs Manhattan distance. Even for $d=2$, the problem is NP-hard, but constant-factor approximations are known. For $d ge 3$, the problem is APX-hard; it is known to admit, for any $eps > 0$, an $O(n^eps)$-approximation. In the generalized minimum Manhattan network problem (GMMN), we are given a set $R$ of $n$ terminal pairs, and the goal is to find a minimum-length rectilinear network such that each pair in $R$ is connected by a Manhattan path. GMMN is a generalization of both MMN and the well-known rectilinear Steiner arborescence problem (RSA). So far, only special cases of GMMN have been considered. We present an $O(log^{d+1} n)$-approximation algorithm for GMMN (and, hence, MMN) in $d ge 2$ dimensions and an $O(log n)$-approximation algorithm for 2D. We show that an existing $O(log n)$-approximation algorithm for RSA in 2D generalizes easily to $d>2$ dimensions.
We consider the problem of subset selection for $ell_{p}$ subspace approximation, i.e., given $n$ points in $d$ dimensions, we need to pick a small, representative subset of the given points such that its span gives $(1+epsilon)$ approximation to the best $k$-dimensional subspace that minimizes the sum of $p$-th powers of distances of all the points to this subspace. Sampling-based subset selection techniques require adaptive sampling iterations with multiple passes over the data. Matrix sketching techniques give a single-pass $(1+epsilon)$ approximation for $ell_{p}$ subspace approximation but require additional passes for subset selection. In this work, we propose an MCMC algorithm to reduce the number of passes required by previous subset selection algorithms based on adaptive sampling. For $p=2$, our algorithm gives subset selection of nearly optimal size in only $2$ passes, whereas the number of passes required in previous work depend on $k$. Our algorithm picks a subset of size $mathrm{poly}(k/epsilon)$ that gives $(1+epsilon)$ approximation to the optimal subspace. The running time of the algorithm is $nd + d~mathrm{poly}(k/epsilon)$. We extend our results to the case when outliers are present in the datasets, and suggest a two pass algorithm for the same. Our ideas also extend to give a reduction in the number of passes required by adaptive sampling algorithms for $ell_{p}$ subspace approximation and subset selection, for $p geq 2$.