ترغب بنشر مسار تعليمي؟ اضغط هنا

Fast Label Extraction in the CDAWG

83   0   0.0 ( 0 )
 نشر من قبل Djamal Belazzougui
 تاريخ النشر 2017
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The compact directed acyclic word graph (CDAWG) of a string $T$ of length $n$ takes space proportional just to the number $e$ of right extensions of the maximal repeats of $T$, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which $e$ grows significantly more slowly than $n$. We reduce from $O(mlog{log{n}})$ to $O(m)$ the time needed to count the number of occurrences of a pattern of length $m$, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from $O(mlog{log{n}}+mathtt{occ})$ to $O(m+mathtt{occ})$ in the time needed to locate all the $mathtt{occ}$ occurrences of the pattern. We also reduce from $O(klog{log{n}})$ to $O(k)$ the time needed to read the $k$ characters of the label of an edge of the suffix tree of $T$, and we reduce from $O(mlog{log{n}})$ to $O(m)$ the time needed to compute the matching statistics between a query of length $m$ and $T$, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.



قيم البحث

اقرأ أيضاً

Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{{overline{T}}})$ words of space, where ${overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(log{log{n}})$ in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which $e_T$ grows sublinearly in the length of $T$ in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between $O(1)$ and $O(log{n})$ in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of $T$ itself, that takes $O(e_T)$ words of space, and that supports random access in $O(log{n})$ time. Furthermore, we establish a connection between the reversed CDAWG of $T$ and a context-free grammar that produces $T$ and only $T$, which might have independent interest.
135 - Xiuwen Gong , Dong Yuan , Wei Bao 2021
Embedding approaches have become one of the most pervasive techniques for multi-label classification. However, the training process of embedding methods usually involves a complex quadratic or semidefinite programming problem, or the model may even i nvolve an NP-hard problem. Thus, such methods are prohibitive on large-scale applications. More importantly, much of the literature has already shown that the binary relevance (BR) method is usually good enough for some applications. Unfortunately, BR runs slowly due to its linear dependence on the size of the input data. The goal of this paper is to provide a simple method, yet with provable guarantees, which can achieve competitive performance without a complex training process. To achieve our goal, we provide a simple stochastic sketch strategy for multi-label classification and present theoretical results from both algorithmic and statistical learning perspectives. Our comprehensive empirical studies corroborate our theoretical findings and demonstrate the superiority of the proposed methods.
135 - Rupei Xu , Andras Farago 2019
Minimum Label Cut (or Hedge Connectivity) problem is defined as follows: given an undirected graph $G=(V, E)$ with $n$ vertices and $m$ edges, in which, each edge is labeled (with one or multiple labels) from a label set $L={ell_1,ell_2, ..., ell_{|L |}}$, the edges may be weighted with weight set $W ={w_1, w_2, ..., w_m}$, the label cut problem(hedge connectivity) problem asks for the minimum number of edge sets(each edge set (or hedge) is the edges with the same label) whose removal disconnects the source-sink pair of vertices or the whole graph with minimum total weights(minimum cardinality for unweighted version). This problem is more general than edge connectivity and hypergraph edge connectivity problem and has a lot of applications in MPLS, IP networks, synchronous optical networks, image segmentation, and other areas. However, due to limited communications between different communities, this problem was studied in different names, with some important existing literature citations missing, or sometimes the results are misleading with some errors. In this paper, we make a further investigation of this problem, give uniform definitions, fix existing errors, provide new insights and show some new results. Specifically, we show the relationship between non-overlapping version(each edge only has one label) and overlapping version(each edge has multiple labels), by fixing the error in the existing literature; hardness and approximation performance between weighted version and unweighted version and some useful properties for further research.
Many joint entity relation extraction models setup two separated label spaces for the two sub-tasks (i.e., entity detection and relation classification). We argue that this setting may hinder the information interaction between entities and relations . In this work, we propose to eliminate the different treatment on the two sub-tasks label spaces. The input of our model is a table containing all word pairs from a sentence. Entities and relations are represented by squares and rectangles in the table. We apply a unified classifier to predict each cells label, which unifies the learning of two sub-tasks. For testing, an effective (yet fast) approximate decoder is proposed for finding squares and rectangles from tables. Experiments on three benchmarks (ACE04, ACE05, SciERC) show that, using only half the number of parameters, our model achieves competitive accuracy with the best extractor, and is faster.
The optimization version of the Unique Label Cover problem is at the heart of the Unique Games Conjecture which has played an important role in the proof of several tight inapproximability results. In recent years, this problem has been also studied extensively from the point of view of parameterized complexity. Cygan et al. [FOCS 2012] proved that this problem is fixed-parameter tractable (FPT) and Wahlstrom [SODA 2014] gave an FPT algorithm with an improved parameter dependence. Subsequently, Iwata, Wahlstrom and Yoshida [2014] proved that the edge version of Unique Label Cover can be solved in linear FPT-time. That is, there is an FPT algorithm whose dependence on the input-size is linear. However, such an algorithm for the node version of the problem was left as an open problem. In this paper, we resolve this question by presenting the first linear-time FPT algorithm for Node Unique Label Cover.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا