No Arabic abstract
We study the distinct elements and $ell_p$-heavy hitters problems in the sliding window model, where only the most recent $n$ elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and $ell_p$-heavy hitters that are nearly optimal in both $n$ and $epsilon$. Applying our new composable histogram framework, we provide an algorithm that outputs a $(1+epsilon)$-approximation to the number of distinct elements in the sliding window model and uses $mathcal{O}left(frac{1}{epsilon^2}log nlogfrac{1}{epsilon}loglog n+frac{1}{epsilon}log^2 nright)$ bits of space. For $ell_p$-heavy hitters, we provide an algorithm using space $mathcal{O}left(frac{1}{epsilon^p}log^2 nleft(log^2log n+logfrac{1}{epsilon}right)right)$ for $0<ple 2$, improving upon the best-known algorithm for $ell_2$-heavy hitters (Braverman et al., COCOON 2014), which has space complexity $mathcal{O}left(frac{1}{epsilon^4}log^3 nright)$. We also show complementing nearly optimal lower bounds of $Omegaleft(frac{1}{epsilon}log^2 n+frac{1}{epsilon^2}log nright)$ for distinct elements and $Omegaleft(frac{1}{epsilon^p}log^2 nright)$ for $ell_p$-heavy hitters, both tight up to $mathcal{O}left(loglog nright)$ and $mathcal{O}left(logfrac{1}{epsilon}right)$ factors.
We study the heavy hitters and related sparse recovery problems in the low-failure probability regime. This regime is not well-understood, and has only been studied for non-adaptive schemes. The main previous work is one on sparse recovery by Gilbert et al.(ICALP13). We recognize an error in their analysis, improve their results, and contribute new non-adaptive and adaptive sparse recovery algorithms, as well as provide upper and lower bounds for the heavy hitters problem with low failure probability.
We explore clustering problems in the streaming sliding window model in both general metric spaces and Euclidean space. We present the first polylogarithmic space $O(1)$-approximation to the metric $k$-median and metric $k$-means problems in the sliding window model, answering the main open problem posed by Babcock, Datar, Motwani and OCallaghan, which has remained unanswered for over a decade. Our algorithm uses $O(k^3 log^6 n)$ space and $operatorname{poly}(k, log n)$ update time. This is an exponential improvement on the space required by the technique due to Babcock, et al. We introduce a data structure that extends smooth histograms as introduced by Braverman and Ostrovsky to operate on a broader class of functions. In particular, we show that using only polylogarithmic space we can maintain a summary of the current window from which we can construct an $O(1)$-approximate clustering solution. Merge-and-reduce is a generic method in computational geometry for adapting offline algorithms to the insertion-only streaming model. Several well-known coreset constructions are maintainable in the insertion-only streaming model using this method, including well-known coreset techniques for the $k$-median, $k$-means in both low-and high-dimensional Euclidean spaces. Previous work has adapted these techniques to the insertion-deletion model, but translating them to the sliding window model has remained a challenge. We give the first algorithm that, given an insertion-only streaming coreset construction of space $s$, maintains a $(1pmepsilon)$-approximate coreset in the sliding window model using $O(s^2epsilon^{-2}log n)$ space. For clustering problems, our results constitute the first significant step towards resolving problem number 20 from the List of Open Problems in Sublinear Algorithms.
In this paper, we study the tradeoff between the approximation guarantee and adaptivity for the problem of maximizing a monotone submodular function subject to a cardinality constraint. The adaptivity of an algorithm is the number of sequential rounds of queries it makes to the evaluation oracle of the function, where in every round the algorithm is allowed to make polynomially-many parallel queries. Adaptivity is an important consideration in settings where the objective function is estimated using samples and in applications where adaptivity is the main running time bottleneck. Previous algorithms achieving a nearly-optimal $1 - 1/e - epsilon$ approximation require $Omega(n)$ rounds of adaptivity. In this work, we give the first algorithm that achieves a $1 - 1/e - epsilon$ approximation using $O(ln{n} / epsilon^2)$ rounds of adaptivity. The number of function evaluations and additional running time of the algorithm are $O(n mathrm{poly}(log{n}, 1/epsilon))$.
We consider time-space tradeoffs for exactly computing frequency moments and order statistics over sliding windows. Given an input of length 2n-1, the task is to output the function of each window of length n, giving n outputs in total. Computations over sliding windows are related to direct sum problems except that inputs to instances almost completely overlap. We show an average case and randomized time-space tradeoff lower bound of TS in Omega(n^2) for multi-way branching programs, and hence standard RAM and word-RAM models, to compute the number of distinct elements, F_0, in sliding windows over alphabet [n]. The same lower bound holds for computing the low-order bit of F_0 and computing any frequency moment F_k for k not equal to 1. We complement this lower bound with a TS in tilde O(n^2) deterministic RAM algorithm for exactly computing F_k in sliding windows. We show time-space separations between the complexity of sliding-window element distinctness and that of sliding-window $F_0bmod 2$ computation. In particular for alphabet [n] there is a very simple errorless sliding-window algorithm for element distinctness that runs in O(n) time on average and uses O(log{n}) space. We show that any algorithm for a single element distinctness instance can be extended to an algorithm for the sliding-window version of element distinctness with at most a polylogarithmic increase in the time-space product. Finally, we show that the sliding-window computation of order statistics such as the maximum and minimum can be computed with only a logarithmic increase in time, but that a TS in Omega(n^2) lower bound holds for sliding-window computation of order statistics such as the median, a nearly linear increase in time when space is small.
An $(epsilon,phi)$-expander decomposition of a graph $G=(V,E)$ is a clustering of the vertices $V=V_{1}cupcdotscup V_{x}$ such that (1) each cluster $V_{i}$ induces subgraph with conductance at least $phi$, and (2) the number of inter-cluster edges is at most $epsilon|E|$. In this paper, we give an improved distributed expander decomposition. Specifically, we construct an $(epsilon,phi)$-expander decomposition with $phi=(epsilon/log n)^{2^{O(k)}}$ in $O(n^{2/k}cdottext{poly}(1/phi,log n))$ rounds for any $epsilonin(0,1)$ and positive integer $k$. For example, a $(0.01,1/text{poly}log n)$-expander decomposition can be computed in $O(n^{gamma})$ rounds, for any arbitrarily small constant $gamma>0$. Previously, the algorithm by Chang, Pettie, and Zhang can construct a $(1/6,1/text{poly}log n)$-expander decomposition using $tilde{O}(n^{1-delta})$ rounds for any $delta>0$, with a caveat that the algorithm is allowed to throw away a set of edges into an extra part which forms a subgraph with arboricity at most $n^{delta}$. Our algorithm does not have this caveat. By slightly modifying the distributed algorithm for routing on expanders by Ghaffari, Kuhn and Su [PODC17], we obtain a triangle enumeration algorithm using $tilde{O}(n^{1/3})$ rounds. This matches the lower bound by Izumi and Le Gall [PODC17] and Pandurangan, Robinson and Scquizzato [SPAA18] of $tilde{Omega}(n^{1/3})$ which holds even in the CONGESTED CLIQUE model. This provides the first non-trivial example for a distributed problem that has essentially the same complexity (up to a polylogarithmic factor) in both CONGEST and CONGESTED CLIQUE. The key technique in our proof is the first distributed approximation algorithm for finding a low conductance cut that is as balanced as possible. Previous distributed sparse cut algorithms do not have this nearly most balanced guarantee.