Efficient Index for Weighted Sequences

52 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jakub Radoszewski

تاريخ النشر 2016

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Carl Barton - Tomasz Kociumaka - Solon P. Pissis

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.

قيم البحث

اقرأ أيضاً

Minimum Weighted Szeged Index Trees

221 - Pavol Hell , Cesar Hernandez-Cruz , Seyyed Aliasghar Hosseini 2020

Weighted Szeged index is a recently introduced extension of the well-known Szeged index. In this paper, we present a new tool to analyze and characterize minimum weighted Szeged index trees. We exhibit the best trees with up to 81 vertices and use th is information, together with our results, to propose various conjectures on the structure of minimum weighted Szeged index trees.

التوافقية

A Weighted Correlation Index for Rankings with Ties

216 - Sebastiano Vigna 2014

Understanding the correlation between two different scores for the same set of items is a common problem in information retrieval, and the most commonly used statistics that quantifies this correlation is Kendalls $tau$. However, the standard definit ion fails to capture that discordances between items with high rank are more important than those between items with low rank. Recently, a new measure of correlation based on average precision has been proposed to solve this problem, but like many alternative proposals in the literature it assumes that there are no ties in the scores. This is a major deficiency in a number of contexts, and in particular while comparing centrality scores on large graphs, as the obvious baseline, indegree, has a very large number of ties in web and social graphs. We propose to extend Kendalls definition in a natural way to take into account weights in the presence of ties. We prove a number of interesting mathematical properties of our generalization and describe an $O(nlog n)$ algorithm for its computation. We also validate the usefulness of our weighted measure of correlation using experimental data.

الشبكات الاجتماعية والمعلومات استرجاع المعلومات

Improved Online Algorithm for Weighted Flow Time

81 - Yossi Azar , Noam Touitou 2017

We discuss one of the most fundamental scheduling problem of processing jobs on a single machine to minimize the weighted flow time (weighted response time). Our main result is a $O(log P)$-competitive algorithm, where $P$ is the maximum-to-minimum p rocessing time ratio, improving upon the $O(log^{2}P)$-competitive algorithm of Chekuri, Khanna and Zhu (STOC 2001). We also design a $O(log D)$-competitive algorithm, where $D$ is the maximum-to-minimum density ratio of jobs. Finally, we show how to combine these results with the result of Bansal and Dhamdhere (SODA 2003) to achieve a $O(log(min(P,D,W)))$-competitive algorithm (where $W$ is the maximum-to-minimum weight ratio), without knowing $P,D,W$ in advance. As shown by Bansal and Chan (SODA 2009), no constant-competitive algorithm is achievable for this problem.

بنى وهياكل البيانات والخوارزميات

Online Algorithms for Weighted Paging with Predictions

184 - Zhihao Jiang , Debmalya Panigrahi , Kevin Sun 2020

In this paper, we initiate the study of the weighted paging problem with predictions. This continues the recent line of work in online algorithms with predictions, particularly that of Lykouris and Vassilvitski (ICML 2018) and Rohatgi (SODA 2020) on unweighted paging with predictions. We show that unlike unweighted paging, neither a fixed lookahead nor knowledge of the next request for every page is sufficient information for an algorithm to overcome existing lower bounds in weighted paging. However, a combination of the two, which we call the strong per request prediction (SPRP) model, suffices to give a 2-competitive algorithm. We also explore the question of gracefully degrading algorithms with increasing prediction error, and give both upper and lower bounds for a set of natural measures of prediction error.

بنى وهياكل البيانات والخوارزميات

The Distribution and Deposition Algorithm for Multiple Sequences Sets

214 - Kang Ning , Hon Wai Leong 2009

Sequences set is a mathematical model used in many applications. As the number of the sequences becomes larger, single sequence set model is not appropriate for the rapidly increasing problem sizes. For example, more and more text processing applicat ions separate a single big text file into multiple files before processing. For these applications, the underline mathematical model is multiple sequences sets (MSS). Though there is increasing use of MSS, there is little research on how to process MSS efficiently. To process multiple sequences sets, sequences are first distributed to different sets, and then sequences for each set are processed. Deriving effective algorithm for MSS processing is both interesting and challenging. In this paper, we have defined the cost functions and performance ratio for analysis of the quality of synthesis sequences. Based on these, the problem of Process of Multiple Sequences Sets (PMSS) is formulated. We have first proposed two greedy algorithms for the PMSS problem, which are based on generalization of algorithms for single sequences set. Then based on the analysis of the characteristics of multiple sequences sets, we have proposed the Distribution and Deposition (DDA) algorithm and DDA* algorithm for PMSS problem. In DDA algorithm, the sequences are first distributed to multiple sets according to their alphabet contents; then sequences in each set are deposited by the deposition algorithm. The DDA* algorithm differs from the DDA algorithm in that the DDA* algorithm distributes sequences by clustering based on sequence profiles. Experiments show that DDA and DDA* always output results with smaller costs than other algorithms, and DDA* outperforms DDA in most instances. The DDA and DDA* algorithms are also efficient both in time and space.

بنى وهياكل البيانات والخوارزميات النظم الموزعة والتوازية والحوسبة العنقودية الرياضيات المتقطعة

سجل دخول لتتمكن من نشر تعليقات