LCS Graph Kernel Based on Wasserstein Distance in Longest Common Subsequence Metric Space

53 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Hiroyuki Kasai

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jianming Huang - Zhongxi Fang - Hiroyuki Kasai

التعلم الآلي الذكاء الاصطناعي بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

For graph classification tasks, many methods use a common strategy to aggregate information of vertex neighbors. Although this strategy provides an efficient means of extracting graph topological features, it brings excessive amounts of information that might greatly reduce its accuracy when dealing with large-scale neighborhoods. Learning graphs using paths or walks will not suffer from this difficulty, but many have low utilization of each path or walk, which might engender information loss and high computational costs. To solve this, we propose a graph kernel using a longest common subsequence (LCS kernel) to compute more comprehensive similarity between paths and walks, which resolves substructure isomorphism difficulties. We also combine it with optimal transport theory to extract more in-depth features of graphs. Furthermore, we propose an LCS metric space and apply an adjacent point merge operation to reduce its computational costs. Finally, we demonstrate that our proposed method outperforms many state-of-the-art graph kernel methods.

قيم البحث

544 - Paola Bonizzoni , Gianluca Della Vedova , Riccardo Dondi 2009

In this work, we consider a variant of the classical Longest Common Subsequence problem called Doubly-Constrained Longest Common Subsequence (DC-LCS). Given two strings s1 and s2 over an alphabet A, a set C_s of strings, and a function Co from A to N , the DC-LCS problem consists in finding the longest subsequence s of s1 and s2 such that s is a supersequence of all the strings in Cs and such that the number of occurrences in s of each symbol a in A is upper bounded by Co(a). The DC-LCS problem provides a clear mathematical formulation of a sequence comparison problem in Computational Biology and generalizes two other constrained variants of the LCS problem: the Constrained LCS and the Repetition-Free LCS. We present two results for the DC-LCS problem. First, we illustrate a fixed-parameter algorithm where the parameter is the length of the solution. Secondly, we prove a parameterized hardness result for the Constrained LCS problem when the parameter is the number of the constraint strings and the size of the alphabet A. This hardness result also implies the parameterized hardness of the DC-LCS problem (with the same parameters) and its NP-hardness when the size of the alphabet is constant.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

Lower Bounds and Improved Algorithms for Asymmetric Streaming Edit Distance and Longest Common Subsequence

154 - Xin Li , Yu Zheng 2021

In this paper, we study edit distance (ED) and longest common subsequence (LCS) in the asymmetric streaming model, introduced by Saks and Seshadhri [SS13]. As an intermediate model between the random access model and the streaming model, this model a llows one to have streaming access to one string and random access to the other string. Our first main contribution is a systematic study of space lower bounds for ED and LCS in the asymmetric streaming model. Previously, there are no explicitly stated results in this context, although some lower bounds about LCS can be inferred from the lower bounds for longest increasing subsequence (LIS) in [SW07][GG10][EJ08]. Yet these bounds only work for large alphabet size. In this paper, we develop several new techniques to handle ED in general and LCS for small alphabet size, thus establishing strong lower bounds for both problems. In particular, our lower bound for ED provides an exponential separation between edit distance and Hamming distance in the asymmetric streaming model. Our lower bounds also extend to LIS and longest non-decreasing sequence (LNS) in the standard streaming model. Together with previous results, our bounds provide an almost complete picture for these two problems. As our second main contribution, we give improved algorithms for ED and LCS in the asymmetric streaming model. For ED, we improve the space complexity of the constant factor approximation algorithms in [FHRS20][CJLZ20] from $tilde{O}(frac{n^delta}{delta})$ to $O(frac{d^delta}{delta};mathsf{polylog}(n))$, where $n$ is the length of each string and $d$ is the edit distance between the two strings. For LCS, we give the first $1/2+epsilon$ approximation algorithm with space $n^{delta}$ for any constant $delta>0$, over a binary alphabet.

بنى وهياكل البيانات والخوارزميات

Heuristic algorithms for the Longest Filled Common Subsequence Problem

121 - Radu Stefan Mincu University ofn Bucharest 2019

At CPM 2017, Castelli et al. define and study a new variant of the Longest Common Subsequence Problem, termed the Longest Filled Common Subsequence Problem (LFCS). For the LFCS problem, the input consists of two strings $A$ and $B$ and a multiset of characters $mathcal{M}$. The goal is to insert the characters from $mathcal{M}$ into the string $B$, thus obtaining a new string $B^*$, such that the Longest Common Subsequence (LCS) between $A$ and $B^*$ is maximized. Casteli et al. show that the problem is NP-hard and provide a 3/5-approximation algorithm for the problem. In this paper we study the problem from the experimental point of view. We introduce, implement and test new heuristic algorithms and compare them with the approximation algorithm of Casteli et al. Moreover, we introduce an Integer Linear Program (ILP) model for the problem and we use the state of the art ILP solver, Gurobi, to obtain exact solution for moderate sized instances.

بنى وهياكل البيانات والخوارزميات

A Linear-Time $n^{0.4}$-Approximation for Longest Common Subsequence

106 - Karl Bringmann , Vincent Cohen-Addad , 2021

We consider the classic problem of computing the Longest Common Subsequence (LCS) of two strings of length $n$. While a simple quadratic algorithm has been known for the problem for more than 40 years, no faster algorithm has been found despite an ex tensive effort. The lack of progress on the problem has recently been explained by Abboud, Backurs, and Vassilevska Williams [FOCS15] and Bringmann and Kunnemann [FOCS15] who proved that there is no subquadratic algorithm unless the Strong Exponential Time Hypothesis fails. This has led the community to look for subquadratic approximation algorithms for the problem. Yet, unlike the edit distance problem for which a constant-factor approximation in almost-linear time is known, very little progress has been made on LCS, making it a notoriously difficult problem also in the realm of approximation. For the general setting, only a naive $O(n^{varepsilon/2})$-approximation algorithm with running time $tilde{O}(n^{2-varepsilon})$ has been known, for any constant $0 < varepsilon le 1$. Recently, a breakthrough result by Hajiaghayi, Seddighin, Seddighin, and Sun [SODA19] provided a linear-time algorithm that yields a $O(n^{0.497956})$-approximation in expectation; improving upon the naive $O(sqrt{n})$-approximation for the first time. In this paper, we provide an algorithm that in time $O(n^{2-varepsilon})$ computes an $tilde{O}(n^{2varepsilon/5})$-approximation with high probability, for any $0 < varepsilon le 1$. Our result (1) gives an $tilde{O}(n^{0.4})$-approximation in linear time, improving upon the bound of Hajiaghayi, Seddighin, Seddighin, and Sun, (2) provides an algorithm whose approximation scales with any subquadratic running time $O(n^{2-varepsilon})$, improving upon the naive bound of $O(n^{varepsilon/2})$ for any $varepsilon$, and (3) instead of only in expectation, succeeds with high probability.

بنى وهياكل البيانات والخوارزميات

An $O(n^2)$ Algorithm for Computing Longest Common Cyclic Subsequence

724 - Shihabur Rahman Chowdhury , Masud Hasan , Sumaiya Iqbal 2009

The {em longest common subsequence (LCS)} problem is a classic and well-studied problem in computer science. LCS is a central problem in stringology and finds broad applications in text compression, error-detecting codes and biological sequence compa rison. However, in numerous contexts, words represent cyclic sequences of symbols and LCS must be generalized to consider all circular shifts of the strings. This occurs especially in computational biology when genetic material is sequenced form circular DNA or RNA molecules. This initiates the problem of {em longest common cyclic subsequence (LCCS)} which finds the longest subsequence between all circular shifts of two strings. In this paper, we give an $O(n^2)$ algorithm for solving LCCS problem where $n$ is the number of symbols in the strings.

بنى وهياكل البيانات والخوارزميات