Fast Label Extraction in the CDAWG

83 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Djamal Belazzougui

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Djamal Belazzougui - Fabio Cunial

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The compact directed acyclic word graph (CDAWG) of a string $T$ of length $n$ takes space proportional just to the number $e$ of right extensions of the maximal repeats of $T$, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which $e$ grows significantly more slowly than $n$. We reduce from $O(mlog{log{n}})$ to $O(m)$ the time needed to count the number of occurrences of a pattern of length $m$, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from $O(mlog{log{n}}+mathtt{occ})$ to $O(m+mathtt{occ})$ in the time needed to locate all the $mathtt{occ}$ occurrences of the pattern. We also reduce from $O(klog{log{n}})$ to $O(k)$ the time needed to read the $k$ characters of the label of an edge of the suffix tree of $T$, and we reduce from $O(mlog{log{n}})$ to $O(m)$ the time needed to compute the matching statistics between a query of length $m$ and $T$, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.

قيم البحث

115 - Djamal Belazzougui , Fabio Cunial 2017

Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{{overline{T}}})$ words of space, where ${overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(log{log{n}})$ in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which $e_T$ grows sublinearly in the length of $T$ in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between $O(1)$ and $O(log{n})$ in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of $T$ itself, that takes $O(e_T)$ words of space, and that supports random access in $O(log{n})$ time. Furthermore, we establish a connection between the reversed CDAWG of $T$ and a context-free grammar that produces $T$ and only $T$, which might have independent interest.

بنى وهياكل البيانات والخوارزميات

Fast Multi-label Learning

135 - Xiuwen Gong , Dong Yuan , Wei Bao 2021

Embedding approaches have become one of the most pervasive techniques for multi-label classification. However, the training process of embedding methods usually involves a complex quadratic or semidefinite programming problem, or the model may even i nvolve an NP-hard problem. Thus, such methods are prohibitive on large-scale applications. More importantly, much of the literature has already shown that the binary relevance (BR) method is usually good enough for some applications. Unfortunately, BR runs slowly due to its linear dependence on the size of the input data. The goal of this paper is to provide a simple method, yet with provable guarantees, which can achieve competitive performance without a complex training process. To achieve our goal, we provide a simple stochastic sketch strategy for multi-label classification and present theoretical results from both algorithmic and statistical learning perspectives. Our comprehensive empirical studies corroborate our theoretical findings and demonstrate the superiority of the proposed methods.

التعلم الآلي التعلم الالي

The Landscape of Minimum Label Cut (Hedge Connectivity) Problem

135 - Rupei Xu , Andras Farago 2019

Minimum Label Cut (or Hedge Connectivity) problem is defined as follows: given an undirected graph $G=(V, E)$ with $n$ vertices and $m$ edges, in which, each edge is labeled (with one or multiple labels) from a label set $L={ell_1,ell_2, ..., ell_{|L |}}$, the edges may be weighted with weight set $W ={w_1, w_2, ..., w_m}$, the label cut problem(hedge connectivity) problem asks for the minimum number of edge sets(each edge set (or hedge) is the edges with the same label) whose removal disconnects the source-sink pair of vertices or the whole graph with minimum total weights(minimum cardinality for unweighted version). This problem is more general than edge connectivity and hypergraph edge connectivity problem and has a lot of applications in MPLS, IP networks, synchronous optical networks, image segmentation, and other areas. However, due to limited communications between different communities, this problem was studied in different names, with some important existing literature citations missing, or sometimes the results are misleading with some errors. In this paper, we make a further investigation of this problem, give uniform definitions, fix existing errors, provide new insights and show some new results. Specifically, we show the relationship between non-overlapping version(each edge only has one label) and overlapping version(each edge has multiple labels), by fixing the error in the existing literature; hardness and approximation performance between weighted version and unweighted version and some useful properties for further research.

بنى وهياكل البيانات والخوارزميات التوافقية التحسين والتحكم

UniRE: A Unified Label Space for Entity Relation Extraction

86 - Yijun Wang , Changzhi Sun , Yuanbin Wu 2021

Many joint entity relation extraction models setup two separated label spaces for the two sub-tasks (i.e., entity detection and relation classification). We argue that this setting may hinder the information interaction between entities and relations . In this work, we propose to eliminate the different treatment on the two sub-tasks label spaces. The input of our model is a table containing all word pairs from a sentence. Entities and relations are represented by squares and rectangles in the table. We apply a unified classifier to predict each cells label, which unifies the learning of two sub-tasks. For testing, an effective (yet fast) approximate decoder is proposed for finding squares and rectangles from tables. Experiments on three benchmarks (ACE04, ACE05, SciERC) show that, using only half the number of parameters, our model achieves competitive accuracy with the best extractor, and is faster.

الحساب واللغة التعلم الآلي

A Linear Time Parameterized Algorithm for Node Unique Label Cover

123 - Daniel Lokshtanov , M. S. Ramanujan , Saket Saurabh 2016

The optimization version of the Unique Label Cover problem is at the heart of the Unique Games Conjecture which has played an important role in the proof of several tight inapproximability results. In recent years, this problem has been also studied extensively from the point of view of parameterized complexity. Cygan et al. [FOCS 2012] proved that this problem is fixed-parameter tractable (FPT) and Wahlstrom [SODA 2014] gave an FPT algorithm with an improved parameter dependence. Subsequently, Iwata, Wahlstrom and Yoshida [2014] proved that the edge version of Unique Label Cover can be solved in linear FPT-time. That is, there is an FPT algorithm whose dependence on the input-size is linear. However, such an algorithm for the node version of the problem was left as an open problem. In this paper, we resolve this question by presenting the first linear-time FPT algorithm for Node Unique Label Cover.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات