ترغب بنشر مسار تعليمي؟ اضغط هنا

Convergence of Datalog over (Pre-) Semirings

72   0   0.0 ( 0 )
 نشر من قبل Remy Wang
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Recursive queries have been traditionally studied in the framework of datalog, a language that restricts recursion to monotone queries over sets, which is guaranteed to converge in polynomial time in the size of the input. But modern big data systems require recursive computations beyond the Boolean space. In this paper we study the convergence of datalog when it is interpreted over an arbitrary semiring. We consider an ordered semiring, define the semantics of a datalog program as a least fixpoint in this semiring, and study the number of steps required to reach that fixpoint, if ever. We identify algebraic properties of the semiring that correspond to certain convergence properties of datalog programs. Finally, we describe a class of ordered semirings on which one can use the semi-naive evaluation algorithm on any datalog program.



قيم البحث

اقرأ أيضاً

Materialisation is often used in RDF systems as a preprocessing step to derive all facts implied by given RDF triples and rules. Although widely used, materialisation considers all possible rule applications and can use a lot of memory for storing th e derived facts, which can hinder performance. We present a novel materialisation technique that compresses the RDF triples so that the rules can sometimes be applied to multiple facts at once, and the derived facts can be represented using structure sharing. Our technique can thus require less space, as well as skip certain rule applications. Our experiments show that our technique can be very effective: when the rules are relatively simple, our system is both faster and requires less memory than prominent state-of-the-art RDF systems.
The multiplicative semigroup $M_n(F)$ of $ntimes n$ matrices over a field $F$ is well understood, in particular, it is a regular semigroup. This paper considers semigroups of the form $M_n(S)$, where $S$ is a semiring, and the subsemigroups $UT_n(S)$ and $U_n(S)$ of $M_n(S)$ consisting of upper triangular and unitriangular matrices. Our main interest is in the case where $S$ is an idempotent semifield, where we also consider the subsemigroups $UT_n(S^*)$ and $U_n(S^*)$ consisting of those matrices of $UT_n(S)$ and $U_n(S)$ having all elements on and above the leading diagonal non-zero. Our guiding examples of such $S$ are the 2-element Boolean semiring $mathbb{B}$ and the tropical semiring $mathbb{T}$. In the first case, $M_n(mathbb{B})$ is isomorphic to the semigroup of binary relations on an $n$-element set, and in the second, $M_n(mathbb{T})$ is the semigroup of $ntimes n$ tropical matrices. Ilin has proved that for any semiring $R$ and $n>2$, the semigroup $M_n(R)$ is regular if and only if $R$ is a regular ring. We therefore base our investigations for $M_n(S)$ and its subsemigroups on the analogous but weaker concept of being Fountain (formerly, weakly abundant). These notions are determined by the existence and behaviour of idempotent left and right identities for elements, lying in particular equivalence classes. We show that certain subsemigroups of $M_n(S)$, including several generalisations of well-studied monoids of binary relations (Hall relations, reflexive relations, unitriangular Boolean matrices), are Fountain. We give a detailed study of a family of Fountain semigroups arising in this way that has particularly interesting and unusual properties.
Arguing for the need to combine declarative and probabilistic programming, Barany et al. (TODS 2017) recently introduced a probabilistic extension of Datalog as a purely declarative probabilistic programming language. We revisit this language and pro pose a more principled approach towards defining its semantics based on stochastic kernels and Markov processes - standard notions from probability theory. This allows us to extend the semantics to continuous probability distributions, thereby settling an open problem posed by Barany et al. We show that our semantics is fairly robust, allowing both parallel execution and arbitrary chase orders when evaluating a program. We cast our semantics in the framework of infinite probabilistic databases (Grohe and Lindner, ICDT 2020), and show that the semantics remains meaningful even when the input of a probabilistic Datalog program is an arbitrary probabilistic database.
Recursive query processing has experienced a recent resurgence, as a result of its use in many modern application domains, including data integration, graph analytics, security, program analysis, networking and decision making. Due to the large volum es of data being processed, several research efforts, across multiple communities, have explored how to scale up recursive queries, typically expressed in Datalog. Our experience with these tools indicated that their performance does not translate across domains (e.g., a tool design for large-scale graph analytics does not exhibit the same performance on program-analysis tasks, and vice versa). As a result, we designed and implemented a general-purpose Datalog engine, called RecStep, on top of a parallel single-node relational system. In this paper, we outline the different techniques we use in RecStep, and the contribution of each technique to overall performance. We also present results from a detailed set of experiments comparing RecStep with a number of other Datalog systems using both graph analytics and program-analysis tasks, summarizing pros and cons of existing techniques based on the analysis of our observations. We show that RecStep generally outperforms the state-of-the-art parallel Datalog engines on complex and large-scale Datalog program evaluation, by a 4-6X margin. An additional insight from our work is that we show that it is possible to build a high-performance Datalog system on top of a relational engine, an idea that has been dismissed in past work in this area.
Cloud computing refers to maximizing efficiency by sharing computational and storage resources, while data-parallel systems exploit the resources available in the cloud to perform parallel transformations over large amounts of data. In the same line, considerable emphasis has been recently given to two apparently disjoint research topics: data-parallel, and eventually consistent, distributed systems. Declarative networking has been recently proposed to ease the task of programming in the cloud, by allowing the programmer to express only the desired result and leave the implementation details to the responsibility of the run-time system. In this context, we propose a study on a logic-programming-based computational model for eventually consistent, data-parallel systems, the keystone of which is provided by the recent finding that the class of programs that can be computed in an eventually consistent, coordination-free way is that of monotonic programs. This principle is called CALM and has been proven by Ameloot et al. for distributed, asynchronous settings. We advocate that CALM should be employed as a basic theoretical tool also for data-parallel systems, wherein computation usually proceeds synchronously in rounds and where communication is assumed to be reliable. It is general opinion that coordination-freedom can be seen as a major discriminant factor. In this work we make the case that the current form of CALM does not hold in general for data-parallel systems, and show how, using novel techniques, the satisfiability of the CALM principle can still be obtained although just for the subclass of programs called connected monotonic queries. We complete the study with considerations on the relationships between our model and the one employed by Ameloot et al., showing that our techniques subsume the latter when the synchronization constraints imposed on the system are loosened.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا