No Arabic abstract
We explore clustering problems in the streaming sliding window model in both general metric spaces and Euclidean space. We present the first polylogarithmic space $O(1)$-approximation to the metric $k$-median and metric $k$-means problems in the sliding window model, answering the main open problem posed by Babcock, Datar, Motwani and OCallaghan, which has remained unanswered for over a decade. Our algorithm uses $O(k^3 log^6 n)$ space and $operatorname{poly}(k, log n)$ update time. This is an exponential improvement on the space required by the technique due to Babcock, et al. We introduce a data structure that extends smooth histograms as introduced by Braverman and Ostrovsky to operate on a broader class of functions. In particular, we show that using only polylogarithmic space we can maintain a summary of the current window from which we can construct an $O(1)$-approximate clustering solution. Merge-and-reduce is a generic method in computational geometry for adapting offline algorithms to the insertion-only streaming model. Several well-known coreset constructions are maintainable in the insertion-only streaming model using this method, including well-known coreset techniques for the $k$-median, $k$-means in both low-and high-dimensional Euclidean spaces. Previous work has adapted these techniques to the insertion-deletion model, but translating them to the sliding window model has remained a challenge. We give the first algorithm that, given an insertion-only streaming coreset construction of space $s$, maintains a $(1pmepsilon)$-approximate coreset in the sliding window model using $O(s^2epsilon^{-2}log n)$ space. For clustering problems, our results constitute the first significant step towards resolving problem number 20 from the List of Open Problems in Sublinear Algorithms.
We study the distinct elements and $ell_p$-heavy hitters problems in the sliding window model, where only the most recent $n$ elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and $ell_p$-heavy hitters that are nearly optimal in both $n$ and $epsilon$. Applying our new composable histogram framework, we provide an algorithm that outputs a $(1+epsilon)$-approximation to the number of distinct elements in the sliding window model and uses $mathcal{O}left(frac{1}{epsilon^2}log nlogfrac{1}{epsilon}loglog n+frac{1}{epsilon}log^2 nright)$ bits of space. For $ell_p$-heavy hitters, we provide an algorithm using space $mathcal{O}left(frac{1}{epsilon^p}log^2 nleft(log^2log n+logfrac{1}{epsilon}right)right)$ for $0<ple 2$, improving upon the best-known algorithm for $ell_2$-heavy hitters (Braverman et al., COCOON 2014), which has space complexity $mathcal{O}left(frac{1}{epsilon^4}log^3 nright)$. We also show complementing nearly optimal lower bounds of $Omegaleft(frac{1}{epsilon}log^2 n+frac{1}{epsilon^2}log nright)$ for distinct elements and $Omegaleft(frac{1}{epsilon^p}log^2 nright)$ for $ell_p$-heavy hitters, both tight up to $mathcal{O}left(loglog nright)$ and $mathcal{O}left(logfrac{1}{epsilon}right)$ factors.
This paper presents universal algorithms for clustering problems, including the widely studied $k$-median, $k$-means, and $k$-center objectives. The input is a metric space containing all potential client locations. The algorithm must select $k$ cluster centers such that they are a good solution for any subset of clients that actually realize. Specifically, we aim for low regret, defined as the maximum over all subsets of the difference between the cost of the algorithms solution and that of an optimal solution. A universal algorithms solution $SOL$ for a clustering problem is said to be an $(alpha, beta)$-approximation if for all subsets of clients $C$, it satisfies $SOL(C) leq alpha cdot OPT(C) + beta cdot MR$, where $OPT(C)$ is the cost of the optimal solution for clients $C$ and $MR$ is the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives of $k$-median, $k$-means, and $k$-center that achieve $(O(1), O(1))$-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other $ell_p$-objectives and the setting where some subset of the clients are fixed. We also give hardness results showing that $(alpha, beta)$-approximation is NP-hard if $alpha$ or $beta$ is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, $(O(1), O(1))$-approximation is the strongest type of guarantee obtainable for universal clustering.
In this paper we develop a unified approach for solving a wide class of sequential selection problems. This class includes, but is not limited to, selection problems with no-information, rank-dependent rewards, and considers both fixed as well as random problem horizons. The proposed framework is based on a reduction of the original selection problem to one of optimal stopping for a sequence of judiciously constructed independent random variables. We demonstrate that our approach allows exact and efficient computation of optimal policies and various performance metrics thereof for a variety of sequential selection problems, several of which have not been solved to date.
We consider time-space tradeoffs for exactly computing frequency moments and order statistics over sliding windows. Given an input of length 2n-1, the task is to output the function of each window of length n, giving n outputs in total. Computations over sliding windows are related to direct sum problems except that inputs to instances almost completely overlap. We show an average case and randomized time-space tradeoff lower bound of TS in Omega(n^2) for multi-way branching programs, and hence standard RAM and word-RAM models, to compute the number of distinct elements, F_0, in sliding windows over alphabet [n]. The same lower bound holds for computing the low-order bit of F_0 and computing any frequency moment F_k for k not equal to 1. We complement this lower bound with a TS in tilde O(n^2) deterministic RAM algorithm for exactly computing F_k in sliding windows. We show time-space separations between the complexity of sliding-window element distinctness and that of sliding-window $F_0bmod 2$ computation. In particular for alphabet [n] there is a very simple errorless sliding-window algorithm for element distinctness that runs in O(n) time on average and uses O(log{n}) space. We show that any algorithm for a single element distinctness instance can be extended to an algorithm for the sliding-window version of element distinctness with at most a polylogarithmic increase in the time-space product. Finally, we show that the sliding-window computation of order statistics such as the maximum and minimum can be computed with only a logarithmic increase in time, but that a TS in Omega(n^2) lower bound holds for sliding-window computation of order statistics such as the median, a nearly linear increase in time when space is small.
Suppose that two independent sets $I$ and $J$ of a graph with $vert I vert = vert J vert$ are given, and a token is placed on each vertex in $I$. The Sliding Token problem is to determine whether there exists a sequence of independent sets which transforms $I$ into $J$ so that each independent set in the sequence results from the previous one by sliding exactly one token along an edge in the graph. It is one of the representative reconfiguration problems that attract the attention from the viewpoint of theoretical computer science. For a yes-instance of a reconfiguration problem, finding a shortest reconfiguration sequence has a different aspect. In general, even if it is polynomial time solvable to decide whether two instances are reconfigured with each other, it can be $mathsf{NP}$-hard to find a shortest sequence between them. In this paper, we show that the problem for finding a shortest sequence between two independent sets is polynomial time solvable for spiders (i.e., trees having exactly one vertex of degree at least three).