ترغب بنشر مسار تعليمي؟ اضغط هنا

Reordering Rows for Better Compression: Beyond the Lexicographic Order

33   0   0.0 ( 0 )
 نشر من قبل Daniel Lemire
 تاريخ النشر 2012
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.


قيم البحث

اقرأ أيضاً

We leverage the powerful lossy image compression algorithm BPG to build a lossless image compression system. Specifically, the original image is first decomposed into the lossy reconstruction obtained after compressing it with BPG and the correspondi ng residual. We then model the distribution of the residual with a convolutional neural network-based probabilistic model that is conditioned on the BPG reconstruction, and combine it with entropy coding to losslessly encode the residual. Finally, the image is stored using the concatenation of the bitstreams produced by BPG and the learned residual coder. The resulting compression system achieves state-of-the-art performance in learned lossless full-resolution image compression, outperforming previous learned approaches as well as PNG, WebP, and JPEG2000.
316 - Yuqing Zhu , Yanzhe An , Yuan Zi 2021
We present a record-breaking result and lessons learned in practicing TPCx-IoT benchmarking for a real-world use case. We find that more system characteristics need to be benchmarked for its application to real-world use cases. We introduce an extens ion to the TPCx-IoT benchmark, covering fundamental requirements of time-series data management for IoT infrastructure. We characterize them as data compression and system scalability. To evaluate these two important features of IoT databases, we propose IoTDataBench and update four aspects of TPCx-IoT, i.e., data generation, workloads, metrics and test procedures. Preliminary evaluation results show systems that fail to effectively compress data or flexibly scale can negatively affect the redesigned metrics, while systems with high compression ratios and linear scalability are rewarded in the final metrics. Such systems have the ability to scale up computing resources on demand and can thus save dollar costs.
Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: Squish achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets.
Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus we might prefer compressed bitmap indexes. Following Oracles lead, bitmaps are often compressed using run-length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high-performance RLE-based bitmap encoding techniques: WAH (Word Aligned Hybrid compression scheme) and Concise (Compressed `n Composable Integer Set). On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e.g., 2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections). Our results challenge the view that RLE-based bitmap compression is best.
System logs record detailed runtime information of software systems and are used as the main data source for many tasks around software engineering. As modern software systems are evolving into large scale and complex structures, logs have become one type of fast-growing big data in industry. In particular, such logs often need to be stored for a long time in practice (e.g., a year), in order to analyze recurrent problems or track security issues. However, archiving logs consumes a large amount of storage space and computing resources, which in turn incurs high operational cost. Data compression is essential to reduce the cost of log storage. Traditional compression tools (e.g., gzip) work well for general texts, but are not tailed for system logs. In this paper, we propose a novel and effective log compression method, namely logzip. Logzip is capable of extracting hidden structures from raw logs via fast iterative clustering and further generating coherent intermediate representations that allow for more effective compression. We evaluate logzip on five large log datasets of different system types, with a total of 63.6 GB in size. The results show that logzip can save about half of the storage space on average over traditional compression tools. Meanwhile, the design of logzip is highly parallel and only incurs negligible overhead. In addition, we share our industrial experience of applying logzip to Huaweis real products.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا