No Arabic abstract
We consider time-space tradeoffs for exactly computing frequency moments and order statistics over sliding windows. Given an input of length 2n-1, the task is to output the function of each window of length n, giving n outputs in total. Computations over sliding windows are related to direct sum problems except that inputs to instances almost completely overlap. We show an average case and randomized time-space tradeoff lower bound of TS in Omega(n^2) for multi-way branching programs, and hence standard RAM and word-RAM models, to compute the number of distinct elements, F_0, in sliding windows over alphabet [n]. The same lower bound holds for computing the low-order bit of F_0 and computing any frequency moment F_k for k not equal to 1. We complement this lower bound with a TS in tilde O(n^2) deterministic RAM algorithm for exactly computing F_k in sliding windows. We show time-space separations between the complexity of sliding-window element distinctness and that of sliding-window $F_0bmod 2$ computation. In particular for alphabet [n] there is a very simple errorless sliding-window algorithm for element distinctness that runs in O(n) time on average and uses O(log{n}) space. We show that any algorithm for a single element distinctness instance can be extended to an algorithm for the sliding-window version of element distinctness with at most a polylogarithmic increase in the time-space product. Finally, we show that the sliding-window computation of order statistics such as the maximum and minimum can be computed with only a logarithmic increase in time, but that a TS in Omega(n^2) lower bound holds for sliding-window computation of order statistics such as the median, a nearly linear increase in time when space is small.
The Windows Scheduling Problem, also known as the Pinwheel Problem, is to schedule periodic jobs subject to their processing frequency demands. Instances are given as a set of jobs that have to be processed infinitely often such that the time interval between two consecutive executions of the same job j is no longer than the jobs given period $p_j$. The key contribution of this work is a new interpretation of the problem variant with exact periods, where the time interval between consecutive executions must be strictly $p_j$. We show that this version is equivalent to a natural combinatorial problem we call Partial Coding. Reductions in both directions can be realized in polynomial time, so that both hardness proofs and algorithms for Partial Coding transfer to Windows Scheduling. Applying this new perspective, we obtain a number of new results regarding the computational complexity of various Windows Scheduling Problem variants. We prove that even the case of one processor and unit-length jobs does not admit a pseudo-polynomial time algorithm unless SAT can be solved by a randomized method in expected quasi-polynomial time. This result also extends to the case of inexact periods, which answers a question that has remained open for more than two decades. Furthermore, we report an error found in a hardness proof previously given for the multi-machine case without machine migration, and we show that this variant reduces to the single-machine case. Finally, we prove that even with unit-length jobs the problem is co-NP-hard when jobs are allowed to migrate between machines.
We explore clustering problems in the streaming sliding window model in both general metric spaces and Euclidean space. We present the first polylogarithmic space $O(1)$-approximation to the metric $k$-median and metric $k$-means problems in the sliding window model, answering the main open problem posed by Babcock, Datar, Motwani and OCallaghan, which has remained unanswered for over a decade. Our algorithm uses $O(k^3 log^6 n)$ space and $operatorname{poly}(k, log n)$ update time. This is an exponential improvement on the space required by the technique due to Babcock, et al. We introduce a data structure that extends smooth histograms as introduced by Braverman and Ostrovsky to operate on a broader class of functions. In particular, we show that using only polylogarithmic space we can maintain a summary of the current window from which we can construct an $O(1)$-approximate clustering solution. Merge-and-reduce is a generic method in computational geometry for adapting offline algorithms to the insertion-only streaming model. Several well-known coreset constructions are maintainable in the insertion-only streaming model using this method, including well-known coreset techniques for the $k$-median, $k$-means in both low-and high-dimensional Euclidean spaces. Previous work has adapted these techniques to the insertion-deletion model, but translating them to the sliding window model has remained a challenge. We give the first algorithm that, given an insertion-only streaming coreset construction of space $s$, maintains a $(1pmepsilon)$-approximate coreset in the sliding window model using $O(s^2epsilon^{-2}log n)$ space. For clustering problems, our results constitute the first significant step towards resolving problem number 20 from the List of Open Problems in Sublinear Algorithms.
We study the distinct elements and $ell_p$-heavy hitters problems in the sliding window model, where only the most recent $n$ elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and $ell_p$-heavy hitters that are nearly optimal in both $n$ and $epsilon$. Applying our new composable histogram framework, we provide an algorithm that outputs a $(1+epsilon)$-approximation to the number of distinct elements in the sliding window model and uses $mathcal{O}left(frac{1}{epsilon^2}log nlogfrac{1}{epsilon}loglog n+frac{1}{epsilon}log^2 nright)$ bits of space. For $ell_p$-heavy hitters, we provide an algorithm using space $mathcal{O}left(frac{1}{epsilon^p}log^2 nleft(log^2log n+logfrac{1}{epsilon}right)right)$ for $0<ple 2$, improving upon the best-known algorithm for $ell_2$-heavy hitters (Braverman et al., COCOON 2014), which has space complexity $mathcal{O}left(frac{1}{epsilon^4}log^3 nright)$. We also show complementing nearly optimal lower bounds of $Omegaleft(frac{1}{epsilon}log^2 n+frac{1}{epsilon^2}log nright)$ for distinct elements and $Omegaleft(frac{1}{epsilon^p}log^2 nright)$ for $ell_p$-heavy hitters, both tight up to $mathcal{O}left(loglog nright)$ and $mathcal{O}left(logfrac{1}{epsilon}right)$ factors.
This paper investigates the problem of utilizing network topology and partial timestamps to detect the information source in a network. The problem incurs prohibitive cost under canonical maximum likelihood estimation (MLE) of the source due to the exponential number of possible infection paths. Our main idea of source detection, however, is to approximate the MLE by an alternative infection path based estimator, the essence of which is to identify the most likely infection path that is consistent with observed timestamps. The source node associated with that infection path is viewed as the estimated source $hat{v}$. We first study the case of tree topology, where by transforming the infection path based estimator into a linear integer programming, we find a reduced search region that remarkably improves the time efficiency. Within this reduced search region, the estimator $hat{v}$ is provably always on a path which we term as emph{candidate path}. This notion enables us to analyze the distribution of $d(v^{ast},hat{v})$, the error distance between $hat{v}$ and the true source $v^{ast}$, on arbitrary tree, which allows us to obtain for the first time, in the literature provable performance guarantee of the estimator under limited timestamps. Specifically, on the infinite $g$-regular tree with uniform sampled timestamps, we get a refined performance guarantee in the sense of a constant bounded $d(v^{ast},hat{v})$. By virtue of time labeled BFS tree, the estimator still performs fairly well when extended to more general graphs. Experiments on both synthetic and real datasets further demonstrate the superior performance of our proposed algorithms.
We study the complexity of Boolean constraint satisfaction problems (CSPs) when the assignment must have Hamming weight in some congruence class modulo M, for various choices of the modulus M. Due to the known classification of tractable Boolean CSPs, this mainly reduces to the study of three cases: 2-SAT, HORN-SAT, and LIN-2 (linear equations mod 2). We classify the moduli M for which these respective problems are polynomial time solvable, and when they are not (assuming the ETH). Our study reveals that this modular constraint lends a surprising richness to these classic, well-studied problems, with interesting broader connections to complexity theory and coding theory. The HORN-SAT case is connected to the covering complexity of polynomials representing the NAND function mod M. The LIN-2 case is tied to the sparsity of polynomials representing the OR function mod M, which in turn has connections to modular weight distribution properties of linear codes and locally decodable codes. In both cases, the analysis of our algorithm as well as the hardness reduction rely on these polynomial representations, highlighting an interesting algebraic common ground between hard cases for our algorithms and the gadgets which show hardness. These new complexity measures of polynomial representations merit further study. The inspiration for our study comes from a recent work by Nagele, Sudakov, and Zenklusen on submodular minimization with a global congruence constraint. Our algorithm for HORN-SAT has strong similarities to their algorithm, and in particular identical kind of set systems arise in both cases. Our connection to polynomial representations leads to a simpler analysis of such set systems, and also sheds light on (but does not resolve) the complexity of submodular minimization with a congruency requirement modulo a composite M.