ﻻ يوجد ملخص باللغة العربية
System logs record detailed runtime information of software systems and are used as the main data source for many tasks around software engineering. As modern software systems are evolving into large scale and complex structures, logs have become one type of fast-growing big data in industry. In particular, such logs often need to be stored for a long time in practice (e.g., a year), in order to analyze recurrent problems or track security issues. However, archiving logs consumes a large amount of storage space and computing resources, which in turn incurs high operational cost. Data compression is essential to reduce the cost of log storage. Traditional compression tools (e.g., gzip) work well for general texts, but are not tailed for system logs. In this paper, we propose a novel and effective log compression method, namely logzip. Logzip is capable of extracting hidden structures from raw logs via fast iterative clustering and further generating coherent intermediate representations that allow for more effective compression. We evaluate logzip on five large log datasets of different system types, with a total of 63.6 GB in size. The results show that logzip can save about half of the storage space on average over traditional compression tools. Meanwhile, the design of logzip is highly parallel and only incurs negligible overhead. In addition, we share our industrial experience of applying logzip to Huaweis real products.
Organizations routinely accumulate semi-structured log datasets generated as the output of code; these datasets remain unused and uninterpreted, and occupy wasted space - this phenomenon has been colloquially referred to as data lake problem. One app
We outline the approach being developed in the neuGRID project to use provenance management techniques for the purposes of capturing and preserving the provenance data that emerges in the specification and execution of workflows in biomedical analyse
Given a dataset and an existing clustering as input, alternative clustering aims to find an alternative partition. One of the state-of-the-art approaches is Kernel Dimension Alternative Clustering (KDAC). We propose a novel Iterative Spectral Method
We present a new method for clustering based on compression. The method doesnt use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NC
In recent years, emerging hardware storage technologies have focused on divergent goals: better performance or lower cost-per-bit of storage. Correspondingly, data systems that employ these new technologies are optimized either to be fast (but expens