No Arabic abstract
A retrieval data structure for a static function $f:Srightarrow {0,1}^r$ supports queries that return $f(x)$ for any $x in S$. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ) (i.e., a Bloom filter alternative) with false positive rate $2^{-r}$. The information-theoretic lower bound for both tasks is $r|S|$ bits. While succinct theoretical constructions using $(1+o(1))r|S|$ bits were known, these could not achieve very small overheads in practice because they have an unfavorable space-time tradeoff hidden in the asymptotic costs or because small overheads would only be reached for physically impossible input sizes. With bumped ribbon retrieval (BuRR), we present the first practical succinct retrieval data structure. In an extensive experimental evaluation BuRR achieves space overheads well below $1,%$ while being faster than most previously used retrieval data structures (typically with space overheads at least an order of magnitude larger) and faster than classical Bloom filters (with space overhead $geq 44,%$). This efficiency, including favorable constants, stems from a combination of simplicity, word parallelism, and high locality. We additionally describe homogeneous ribbon filter AMQs, which are even simpler and faster at the price of slightly larger space overhead.
Approximate Membership Query structures (AMQs) rely on randomisation for time- and space-efficiency, while introducing a possibility of false positive and false negative answers. Correctness proofs of such structures involve subtle reasoning about bounds on probabilities of getting certain outcomes. Because of these subtleties, a number of unsound arguments in such proofs have been made over the years. In this work, we address the challenge of building rigorous and reusable computer-assisted proofs about probabilistic specifications of AMQs. We describe the framework for systematic decomposition of AMQs and their properties into a series of interfaces and reusable components. We implement our framework as a library in the Coq proof assistant and showcase it by encoding in it a number of non-trivial AMQs, such as Bloom filters, counting filters, quotient filters and blocked constructions, and mechanising the proofs of their probabilistic specifications. We demonstrate how AMQs encoded in our framework guarantee the absence of false negatives by construction. We also show how the proofs about probabilities of false positives for complex AMQs can be obtained by means of verified reduction to the implementations of their simpler counterparts. Finally, we provide a library of domain-specific theorems and tactics that allow a high degree of automation in probabilistic proofs.
We present a novel hashing strategy for approximate furthest neighbor search that selects projection bases using the data distribution. This strategy leads to an algorithm, which we call DrusillaHash, that is able to outperform existing approximate furthest neighbor strategies. Our strategy is motivated by an empirical study of the behavior of the furthest neighbor search problem, which lends intuition for where our algorithm is most useful. We also present a variant of the algorithm that gives an absolute approximation guarantee; to our knowledge, this is the first such approximate furthest neighbor hashing approach to give such a guarantee. Performance studies indicate that DrusillaHash can achieve comparable levels of approximation to other algorithms while giving up to an order of magnitude speedup. An implementation is available in the mlpack machine learning library (found at http://www.mlpack.org).
The membership problem asks to maintain a set $Ssubseteq[u]$, supporting insertions and membership queries, i.e., testing if a given element is in the set. A data structure that computes exact answers is called a dictionary. When a (small) false positive rate $epsilon$ is allowed, the data structure is called a filter. The space usages of the standard dictionaries or filters usually depend on the upper bound on the size of $S$, while the actual set can be much smaller. Pagh, Segev and Wieder (FOCS13) were the first to study filters with varying space usage based on the current $|S|$. They showed in order to match the space with the current set size $n=|S|$, any filter data structure must use $(1-o(1))n(log(1/epsilon)+(1-O(epsilon))loglog n)$ bits, in contrast to the well-known lower bound of $Nlog(1/epsilon)$ bits, where $N$ is an upper bound on $|S|$. They also presented a data structure with almost optimal space of $(1+o(1))n(log(1/epsilon)+O(loglog n))$ bits provided that $n>u^{0.001}$, with expected amortized constant insertion time and worst-case constant lookup time. In this work, we present a filter data structure with improvements in two aspects: - it has constant worst-case time for all insertions and lookups with high probability; - it uses space $(1+o(1))n(log (1/epsilon)+loglog n)$ bits when $n>u^{0.001}$, achieving optimal leading constant for all $epsilon=o(1)$. We also present a dictionary that uses $(1+o(1))nlog(u/n)$ bits of space, matching the optimal space in terms of the current size, and performs all operations in constant time with high probability.
The recent breakthrough paper by Calude et al. has given the first algorithm for solving parity games in quasi-polynomial time, where previously the best algorithms were mildly subexponential. We devise an alternative quasi-polynomial time algorithm based on progress measures, which allows us to reduce the space required from quasi-polynomial to nearly linear. Our key technical tools are a novel concept of ordered tree coding, and a succinct tree coding result that we prove using bounded adaptive multi-counters, both of which are interesting in their own right.
Let $mathbb{F}[X]$ be the polynomial ring over the variables $X={x_1,x_2, ldots, x_n}$. An ideal $I=langle p_1(x_1), ldots, p_n(x_n)rangle$ generated by univariate polynomials ${p_i(x_i)}_{i=1}^n$ is a emph{univariate ideal}. We study the ideal membership problem for the univariate ideals and show the following results. item Let $f(X)inmathbb{F}[ell_1, ldots, ell_r]$ be a (low rank) polynomial given by an arithmetic circuit where $ell_i : 1leq ileq r$ are linear forms, and $I=langle p_1(x_1), ldots, p_n(x_n)rangle$ be a univariate ideal. Given $vec{alpha}in {mathbb{F}}^n$, the (unique) remainder $f(X) pmod I$ can be evaluated at $vec{alpha}$ in deterministic time $d^{O(r)}cdot poly(n)$, where $d=max{deg(f),deg(p_1)ldots,deg(p_n)}$. This yields an $n^{O(r)}$ algorithm for minimum vertex cover in graphs with rank-$r$ adjacency matrices. It also yields an $n^{O(r)}$ algorithm for evaluating the permanent of a $ntimes n$ matrix of rank $r$, over any field $mathbb{F}$. Over $mathbb{Q}$, an algorithm of similar run time for low rank permanent is due to Barvinok[Bar96] via a different technique. item Let $f(X)inmathbb{F}[X]$ be given by an arithmetic circuit of degree $k$ ($k$ treated as fixed parameter) and $I=langle p_1(x_1), ldots, p_n(x_n)rangle$. We show in the special case when $I=langle x_1^{e_1}, ldots, x_n^{e_n}rangle$, we obtain a randomized $O^*(4.08^k)$ algorithm that uses $poly(n,k)$ space. item Given $f(X)inmathbb{F}[X]$ by an arithmetic circuit and $I=langle p_1(x_1), ldots, p_k(x_k) rangle$, membership testing is $W[1]$-hard, parameterized by $k$. The problem is $MINI[1]$-hard in the special case when $I=langle x_1^{e_1}, ldots, x_k^{e_k}rangle$.