No Arabic abstract
The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Zivs LZ77 algorithm is the de facto choice in this scenario because of its decompression speed and its flexibility in trading decompression speed versus compressed-space efficiency. Each of the existing implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing the Bicriteria LZ77-Parsing problem, which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount (or vice-versa). This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively). We solve this problem efficiently in O(n log^2 n) time and optimal linear space within a small, additive approximation, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file. The preliminary set of experiments shows that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice.
We consider the problem of decoding a discrete signal of categorical variables from the observation of several histograms of pooled subsets of it. We present an Approximate Message Passing (AMP) algorithm for recovering the signal in the random dense setting where each observed histogram involves a random subset of entries of size proportional to n. We characterize the performance of the algorithm in the asymptotic regime where the number of observations $m$ tends to infinity proportionally to n, by deriving the corresponding State Evolution (SE) equations and studying their dynamics. We initiate the analysis of the multi-dimensional SE dynamics by proving their convergence to a fixed point, along with some further properties of the iterates. The analysis reveals sharp phase transition phenomena where the behavior of AMP changes from exact recovery to weak correlation with the signal as m/n crosses a threshold. We derive formulae for the threshold in some special cases and show that they accurately match experimental behavior.
In this paper we discuss a novel data compression technique for binary symmetric sources based on the cavity method over a Galois Field of order q (GF(q)). We present a scheme of low complexity and near optimal empirical performance. The compression step is based on a reduction of sparse low density parity check codes over GF(q) and is done through the so called reinforced belief-propagation equations. These reduced codes appear to have a non-trivial geometrical modification of the space of codewords which makes such compression computationally feasible. The computational complexity is O(d.n.q.log(q)) per iteration, where d is the average degree of the check nodes and n is the number of bits. For our code ensemble, decompression can be done in a time linear in the codes length by a simple leaf-removal algorithm.
The scheme of the sliding window is known in Information Theory, Computer Science, the problem of predicting and in stastistics. Let a source with unknown statistics generate some word $... x_{-1}x_{0}x_{1}x_{2}...$ in some alphabet $A$. For every moment $t, t=... $ $-1, 0, 1, ...$, one stores the word (window) $ x_{t-w} x_{t-w+1}... x_{t-1}$ where $w$,$w geq 1$, is called window length. In the theory of universal coding, the code of the $x_{t}$ depends on source ststistics estimated by the window, in the problem of predicting, each letter $x_{t}$ is predicted using information of the window, etc. After that the letter $x_{t}$ is included in the window on the right, while $x_{t-w}$ is removed from the window. It is the sliding window scheme. This scheme has two merits: it allows one i) to estimate the source statistics quite precisely and ii) to adapt the code in case of a change in the source statistics. However this scheme has a defect, namely, the necessity to store the window (i.e. the word $x_{t-w}... x_{t-1})$ which needs a large memory size for large $w$. A new scheme named the Imaginary Sliding Window (ISW) is constructed. The gist of this scheme is that not the last element $x_{t-w}$ but rather a random one is removed from the window. This allows one to retain both merits of the sliding window as well as the possibility of not storing the window and thus significantly decreasing the memory size.
This paper provides an extensive study of the behavior of the best achievable rate (and other related fundamental limits) in variable-length lossless compression. In the non-asymptotic regime, the fundamental limits of fixed-to-variable lossless compression with and without prefix constraints are shown to be tightly coupled. Several precise, quantitative bounds are derived, connecting the distribution of the optimal codelengths to the source information spectrum, and an exact analysis of the best achievable rate for arbitrary sources is given. Fine asymptotic results are proved for arbitrary (not necessarily prefix) compressors on general mixing sources. Non-asymptotic, explicit Gaussian approximation bounds are established for the best achievable rate on Markov sources. The source dispersion and the source varentropy rate are defined and characterized. Together with the entropy rate, the varentropy rate serves to tightly approximate the fundamental non-asymptotic limits of fixed-to-variable compression for all but very small blocklengths.
Suppose there is a large file which should be transmitted (or stored) and there are several (say, m) admissible data-compressors. It seems natural to try all the compressors and then choose the best, i.e. the one that gives the shortest compressed file. Then transfer (or store) the index number of the best compressor (it requires log m bits) and the compressed file.The only problem is the time, which essentially increases due to the need to compress the file m times (in order to find the best compressor). We propose a method that encodes the file with the optimal compressor, but uses a relatively small additional time: the ratio of this extra time and the total time of calculation can be limited by an arbitrary positive constant. Generally speaking, in many situations it may be necessary find the best data compressor out of a given set, which is often done by comparing them empirically. One of the goals of this work is to turn such a selection process into a part of the data compression method, automating and optimizing it.