No Arabic abstract
We study the problem of computing an approximate maximum cardinality matching in the semi-streaming model when edges arrive in a emph{random} order. In the semi-streaming model, the edges of the input graph G = (V,E) are given as a stream e_1, ..., e_m, and the algorithm is allowed to make a single pass over this stream while using $O(n textrm{polylog}(n))$ space ($m = |E|$ and $n = |V|$). If the order of edges is adversarial, a simple single-pass greedy algorithm yields a $1/2$-approximation in $O(n)$ space; achieving a better approximation in adversarial streams remains an elusive open question. A line of recent work shows that one can improve upon the $1/2$-approximation if the edges of the stream arrive in a random order. The state of the art for this model is two-fold: Assadi et al. [SODA 2019] show how to compute a $2/3(sim.66)$-approximate matching, but the space requirement is $O(n^{1.5} textrm{polylog}(n))$. Very recently, Farhadi et al. [SODA 2020] presented an algorithm with the desired space usage of $O(n textrm{polylog}(n))$, but a worse approximation ratio of $6/11(sim.545)$, or $3/5(=.6)$ in bipartite graphs. In this paper, we present an algorithm that computes a $2/3(sim.66)$-approximate matching using only $O(n log(n))$ space, improving upon both results above. We also note that for adversarial streams, a lower bound of Kapralov [SODA 2013] shows that any algorithm that achieves a $1-1/e(sim.63)$-approximation requires $(n^{1+Omega(1/loglog(n))})$ space. Our result for random-order streams is the first to go beyond the adversarial-order lower bound, thus establishing that computing a maximum matching is provably easier in random-order streams.
We study the maximum matching problem in the random-order semi-streaming setting. In this problem, the edges of an arbitrary $n$-vertex graph $G=(V, E)$ arrive in a stream one by one and in a random order. The goal is to have a single pass over the stream, use $n cdot poly(log n)$ space, and output a large matching of $G$. We prove that for an absolute constant $epsilon_0 > 0$, one can find a $(2/3 + epsilon_0)$-approximate maximum matching of $G$ using $O(n log n)$ space with high probability. This breaks the natural boundary of $2/3$ for this problem prevalent in the prior work and resolves an open problem of Bernstein [ICALP20] on whether a $(2/3 + Omega(1))$-approximation is achievable.
We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.
In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a emph{coreset}, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for $k$-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores $mathcal{O}(klog(hDelta)+h)$ points where $h$ is the half-life of the decay function and $Delta$ is the aspect ratio of the dataset. Our techniques extend to $k$-means clustering and $M$-estimators as well.
Given two strings $S$ and $P$, the Episode Matching problem is to compute the length of the shortest substring of $S$ that contains $P$ as a subsequence. The best known upper bound for this problem is $tilde O(nm)$ by Das et al. (1997), where $n,m$ are the lengths of $S$ and $P$, respectively. Although the problem is well studied and has many applications in data mining, this bound has never been improved. In this paper we show why this is the case by proving that an $O((nm)^{1-epsilon})$ algorithm (even for binary strings) would refute the popular Strong Exponential Time Hypothesis (SETH). The proof is based on a simple reduction from Orthogonal Vectors.
In the streaming model, the order of the stream can significantly affect the difficulty of a problem. A $t$-semirandom stream was introduced as an interpolation between random-order ($t=1$) and adversarial-order ($t=n$) streams where an adversary intercepts a random-order stream and can delay up to $t$ elements at a time. IITK Sublinear Open Problem #15 asks to find algorithms whose performance degrades smoothly as $t$ increases. We show that the celebrated online facility location algorithm achieves an expected competitive ratio of $O(frac{log t}{log log t})$. We present a matching lower bound that any randomized algorithm has an expected competitive ratio of $Omega(frac{log t}{log log t})$. We use this result to construct an $O(1)$-approximate streaming algorithm for $k$-median clustering that stores $O(k log t)$ points and has $O(k log t)$ worst-case update time. Our technique generalizes to any dissimilarity measure that satisfies a weak triangle inequality, including $k$-means, $M$-estimators, and $ell_p$ norms. The special case $t=1$ yields an optimal $O(k)$ space algorithm for random-order streams as well as an optimal $O(nk)$ time algorithm in the RAM model, closing a long line of research on this problem.