No Arabic abstract
We give the first single-pass streaming algorithm for Column Subset Selection with respect to the entrywise $ell_p$-norm with $1 leq p < 2$. We study the $ell_p$ norm loss since it is often considered more robust to noise than the standard Frobenius norm. Given an input matrix $A in mathbb{R}^{d times n}$ ($n gg d$), our algorithm achieves a multiplicative $k^{frac{1}{p} - frac{1}{2}}text{poly}(log nd)$-approximation to the error with respect to the best possible column subset of size $k$. Furthermore, the space complexity of the streaming algorithm is optimal up to a logarithmic factor. Our streaming algorithm also extends naturally to a 1-round distributed protocol with nearly optimal communication cost. A key ingredient in our algorithms is a reduction to column subset selection in the $ell_{p,2}$-norm, which corresponds to the $p$-norm of the vector of Euclidean norms of each of the columns of $A$. This enables us to leverage strong coreset constructions for the Euclidean norm, which previously had not been applied in this context. We also give the first provable guarantees for greedy column subset selection in the $ell_{1, 2}$ norm, which can be used as an alternative, practical subroutine in our algorithms. Finally, we show that our algorithms give significant practical advantages on real-world data analysis tasks.
A streaming algorithm is said to be adversarially robust if its accuracy guarantees are maintained even when the data stream is chosen maliciously, by an adaptive adversary. We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy. This connection allows us to design new adversarially robust streaming algorithms that outperform the current state-of-the-art constructions for many interesting regimes of parameters.
Given a set (or multiset) S of n numbers and a target number t, the subset sum problem is to decide if there is a subset of S that sums up to t. There are several methods for solving this problem, including exhaustive search, divide-and-conquer method, and Bellmans dynamic programming method. However, none of them could generate universal and light code. In this paper, we present a new deterministic algorithm based on a novel data arrangement, which could generate such code and return all solutions. If n is small enough, it is efficient for usual purpose. We also present a probabilistic version with one-sided error and a greedy algorithm which could generate a solution with minimized variance.
In the Subset Sum problem we are given a set of $n$ positive integers $X$ and a target $t$ and are asked whether some subset of $X$ sums to $t$. Natural parameters for this problem that have been studied in the literature are $n$ and $t$ as well as the maximum input number $rm{mx}_X$ and the sum of all input numbers $Sigma_X$. In this paper we study the dense case of Subset Sum, where all these parameters are polynomial in $n$. In this regime, standard pseudo-polynomial algorithms solve Subset Sum in polynomial time $n^{O(1)}$. Our main question is: When can dense Subset Sum be solved in near-linear time $tilde{O}(n)$? We provide an essentially complete dichotomy by designing improved algorithms and proving conditional lower bounds, thereby determining essentially all settings of the parameters $n,t,rm{mx}_X,Sigma_X$ for which dense Subset Sum is in time $tilde{O}(n)$. For notational convenience we assume without loss of generality that $t ge rm{mx}_X$ (as larger numbers can be ignored) and $t le Sigma_X/2$ (using symmetry). Then our dichotomy reads as follows: - By reviving and improving an additive-combinatorics-based approach by Galil and Margalit [SICOMP91], we show that Subset Sum is in near-linear time $tilde{O}(n)$ if $t gg rm{mx}_X Sigma_X/n^2$. - We prove a matching conditional lower bound: If Subset Sum is in near-linear time for any setting with $t ll rm{mx}_X Sigma_X/n^2$, then the Strong Exponential Time Hypothesis and the Strong k-Sum Hypothesis fail. We also generalize our algorithm from sets to multi-sets, albeit with non-matching upper and lower bounds.
We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyds algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.
We present new lower bounds that show that a polynomial number of passes are necessary for solving some fundamental graph problems in the streaming model of computation. For instance, we show that any streaming algorithm that finds a weighted minimum $s$-$t$ cut in an $n$-vertex undirected graph requires $n^{2-o(1)}$ space unless it makes $n^{Omega(1)}$ passes over the stream. To prove our lower bounds, we introduce and analyze a new four-player communication problem that we refer to as the hidden-pointer chasing problem. This is a problem in spirit of the standard pointer chasing problem with the key difference that the pointers in this problem are hidden to players and finding each one of them requires solving another communication problem, namely the set intersection problem. Our lower bounds for graph problems are then obtained by reductions from the hidden-pointer chasing problem. Our hidden-pointer chasing problem appears flexible enough to find other applications and is therefore interesting in its own right. To showcase this, we further present an interesting application of this problem beyond streaming algorithms. Using a reduction from hidden-pointer chasing, we prove that any algorithm for submodular function minimization needs to make $n^{2-o(1)}$ value queries to the function unless it has a polynomial degree of adaptivity.