A new distance based on minimal absent words and applications to biological sequences

56 0 0.0 ( 0 )

Download Cite

Added by Giuseppa Castiglione

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Giuseppa Castiglione - Jia Gao - Sabrina Mantaci

Formal Languages and Automata Theory

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

A minimal absent word of a sequence x, is a sequence yt hat is not a factorof x, but all of its proper factors are factors of x as well. The set of minimal absent words uniquely defines the sequence itself. In recent times minimal absent words have been used in order to compare sequences. In fact, to do this, one can compare the sets of their minimal absent words. Chairungasee and Crochemorein [2] define a distance between pairs of sequences x and y, where the symmetric difference of the sets of minimal absent words of x and y is involved. Here, weconsider a different distance, introduced in [1], based on a specific subset of such symmetric difference that, in our opinion, better capture the different features ofthe considered sequences. We show the result of some experiments where the distance is tested on a dataset of genetic sequences by 11 living species, in order to compare the new distance with the ones existing in literature.

rate research

Minimal Forbidden Factors of Circular Words

80 - Gabriele Fici , Antonio Restivo , Laura Rizzo 2017

Minimal forbidden factors are a useful tool for investigating properties of words and languages. Two factorial languages are distinct if and only if they have different (antifactorial) sets of minimal forbidden factors. There exist algorithms for computing the minimal forbidden factors of a word, as well as of a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an algorithm that, given the trie recognizing a finite antifactorial language $M$, computes a DFA recognizing the language whose set of minimal forbidden factors is $M$. In the same paper, they showed that the obtained DFA is minimal if the input trie recognizes the minimal forbidden factors of a single word. We generalize this result to the case of a circular word. We discuss several combinatorial properties of the minimal forbidden factors of a circular word. As a byproduct, we obtain a formal definition of the factor automaton of a circular word. Finally, we investigate the case of minimal forbidden factors of the circular Fibonacci words.

Formal Languages and Automata Theory Discrete Mathematics

On Sets of Words of Rank Two

132 - Giuseppa Castiglione , Gabriele Fici , Antonio Restivo 2018

Given a (finite or infinite) subset $X$ of the free monoid $A^*$ over a finite alphabet $A$, the rank of $X$ is the minimal cardinality of a set $F$ such that $X subseteq F^*$. A submonoid $M$ generated by $k$ elements of $A^*$ is $k$-maximal if there does not exist another submonoid generated by at most $k$ words containing $M$. We call a set $X subseteq A^*$ primitive if it is the basis of a $|X|$-maximal submonoid. This extends the notion of primitive word: indeed, ${w}$ is a primitive set if and only if $w$ is a primitive word. By definition, for any set $X$, there exists a primitive set $Y$ such that $X subseteq Y^*$. The set $Y$ is therefore called a primitive root of $X$. As a main result, we prove that if a set has rank $2$, then it has a unique primitive root. This result cannot be extended to sets of rank larger than 2. For a single word $w$, we say that the set ${x,y}$ is a {em binary root} of $w$ if $w$ can be written as a concatenation of copies of $x$ and $y$ and ${x,y}$ is a primitive set. We prove that every primitive word $w$ has at most one binary root ${x,y}$ such that $|x|+|y|<sqrt{|w|}$. That is, the binary root of a word is unique provided the length of the word is sufficiently large with respect to the size of the root. Our results are also compared to previous approaches that investigate pseudo-repetitions, where a morphic involutive function $theta$ is defined on $A^*$. In this setting, the notions of $theta$-power, $theta$-primitive and $theta$-root are defined, and it is shown that any word has a unique $theta$-primitive root. This result can be obtained with our approach by showing that a word $w$ is $theta$-primitive if and only if ${w, theta(w)}$ is a primitive set.

Formal Languages and Automata Theory Discrete Mathematics Combinatorics

Ranked enumeration of MSO logic on words

112 - Pierre Bourhis , Alejandro Grez , Louis Jachiet 2020

In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user. In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and regular complex event processing queries and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words.

Formal Languages and Automata Theory Data Structures and Algorithms

On balanced sequences and their critical exponent

50 - Francesco Dolce , Lubomira Dvorakova , Edita Pelantova 2021

We study aperiodic balanced sequences over finite alphabets. A sequence v of this type is fully characterised by a Sturmian sequence u and two constant gap sequences y and y. We show that the language of v is eventually dendric and we focus on return words to its factors. We deduce a method computing critical exponent and asymptotic critical exponent of balanced sequences provided the associated Sturmian sequence u has a quadratic slope. The method is based on looking for the shortest return words to bispecial factors in v. We illustrate our method on several examples, in particular we confirm a conjecture of Rampersad, Shallit and Vandomme that two specific sequences have the least critical exponent among all balanced sequences over 9- resp. 10-letter alphabets.

Formal Languages and Automata Theory Combinatorics

Transfinite Lyndon words

57 - Olivier Carton , Luc Boasson 2018

In this paper, we extend the notion of Lyndon word to transfinite words. We prove two main results. We first show that, given a transfinite word, there exists a unique factorization in Lyndon words that are densely non-increasing, a relaxation of the condition used in the case of finite words. In the annex, we prove that the factorization of a rational word has a special form and that it can be computed from a rational expression describing the word.

Formal Languages and Automata Theory