How to Store a Random Walk


Abstract in English

Motivated by storage applications, we study the following data structure problem: An encoder wishes to store a collection of jointly-distributed files $overline{X}:=(X_1,X_2,ldots, X_n) sim mu$ which are emph{correlated} ($H_mu(overline{X}) ll sum_i H_mu(X_i)$), using as little (expected) memory as possible, such that each individual file $X_i$ can be recovered quickly with few (ideally constant) memory accesses. In the case of independent random files, a dramatic result by Pat (FOCS08) and subsequently by Dodis, Pat and Thorup (STOC10) shows that it is possible to store $overline{X}$ using just a emph{constant} number of extra bits beyond the information-theoretic minimum space, while at the same time decoding each $X_i$ in constant time. However, in the (realistic) case where the files are correlated, much weaker results are known, requiring at least $Omega(n/polylg n)$ extra bits for constant decoding time, even for simple joint distributions $mu$. We focus on the natural case of compressingemph{Markov chains}, i.e., storing a length-$n$ random walk on any (possibly directed) graph $G$. Denoting by $kappa(G,n)$ the number of length-$n$ walks on $G$, we show that there is a succinct data structure storing a random walk using $lg_2 kappa(G,n) + O(lg n)$ bits of space, such that any vertex along the walk can be decoded in $O(1)$ time on a word-RAM. For the harder task of matching the emph{point-wise} optimal space of the walk, i.e., the empirical entropy $sum_{i=1}^{n-1} lg (deg(v_i))$, we present a data structure with $O(1)$ extra bits at the price of $O(lg n)$ decoding time, and show that any improvement on this would lead to an improved solution on the long-standing Dictionary problem. All of our data structures support the emph{online} version of the problem with constant update and query time.

Download