أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Paolo Ferragina

45 - Andrea Farruggia , Paolo Ferragina , Antonio Frangioni 2013

The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Zivs LZ77 algorithm is the de facto choice in this scenario because of its decompression speed and its flexibility in trading decompression speed versus compressed-space efficiency. Each of the existing implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing the Bicriteria LZ77-Parsing problem, which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount (or vice-versa). This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively). We solve this problem efficiently in O(n log^2 n) time and optimal linear space within a small, additive approximation, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file. The preliminary set of experiments shows that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice.

نظرية المعلومات بنى وهياكل البيانات والخوارزميات نظرية المعلومات

Fast and accurate annotation of short texts with Wikipedia pages

36 - Paolo Ferragina , Ugo Scaiella 2010

We address the problem of cross-referencing text fragments with Wikipedia pages, in a way that synonymy and polysemy issues are resolved accurately and efficiently. We take inspiration from a recent flow of work [Cucerzan 2007, Mihalcea and Csomai 20 07, Milne and Witten 2008, Chakrabarti et al 2009], and extend their scenario from the annotation of long documents to the annotation of short texts, such as snippets of search-engine results, tweets, news, blogs, etc.. These short and poorly composed texts pose new challenges in terms of efficiency and effectiveness of the annotation process, that we address by designing and engineering TAGME, the first system that performs an accurate and on-the-fly annotation of these short textual fragments. A large set of experiments shows that TAGME outperforms state-of-the-art algorithms when they are adapted to work on short texts and it results fast and competitive on long texts.

استرجاع المعلومات

Lightweight Data Indexing and Compression in External Memory

104 - Paolo Ferragina , Travis Gagie , Giovanni Manzini 2009

In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size $n$, they use only ${n} $ bits of disk working space while all previous approaches use $Th{n log n}$ bits of disk working space. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scan-based algorithm for inverting the BWT that uses $Th{n}$ bits of working space, and a lightweight {em internal-memory} algorithm for computing the BWT which is the fastest in the literature when the available working space is $os{n}$ bits. Finally, we prove {em lower} bounds on the complexity of computing and inverting the BWT via sequential scans in terms of the classic product: internal-memory space $times$ number of passes over the disk data.

بنى وهياكل البيانات والخوارزميات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد