No Arabic abstract
Arguing for the need to combine declarative and probabilistic programming, Barany et al. (TODS 2017) recently introduced a probabilistic extension of Datalog as a purely declarative probabilistic programming language. We revisit this language and propose a more principled approach towards defining its semantics based on stochastic kernels and Markov processes - standard notions from probability theory. This allows us to extend the semantics to continuous probability distributions, thereby settling an open problem posed by Barany et al. We show that our semantics is fairly robust, allowing both parallel execution and arbitrary chase orders when evaluating a program. We cast our semantics in the framework of infinite probabilistic databases (Grohe and Lindner, ICDT 2020), and show that the semantics remains meaningful even when the input of a probabilistic Datalog program is an arbitrary probabilistic database.
Statistical models of real world data typically involve continuous probability distributions such as normal, Laplace, or exponential distributions. Such distributions are supported by many probabilistic modelling formalisms, including probabilistic database systems. Yet, the traditional theoretical framework of probabilistic databases focusses entirely on finite probabilistic databases. Only recently, we set out to develop the mathematical theory of infinite probabilistic databases. The present paper is an exposition of two recent papers which are cornerstones of this theory. In (Grohe, Lindner; ICDT 2020) we propose a very general framework for probabilistic databases, possibly involving continuous probability distributions, and show that queries have a well-defined semantics in this framework. In (Grohe, Kaminski, Katoen, Lindner; PODS 2020) we extend the declarative probabilistic programming language Generative Datalog, proposed by (Barany et al.~2017) for discrete probability distributions, to continuous probability distributions and show that such programs yield generative models of continuous probabilistic databases.
We study termination of higher-order probabilistic functional programs with recursion, stochastic conditioning and sampling from continuous distributions. Reasoning about the termination probability of programs with continuous distributions is hard, because the enumeration of terminating executions cannot provide any non-trivial bounds. We present a new operational semantics based on traces of intervals, which is sound and complete with respect to the standard sampling-based semantics, in which (countable) enumeration can provide arbitrarily tight lower bounds. Consequently we obtain the first proof that deciding almost-sure termination (AST) for programs with continuous distributions is $Pi^0_2$-complete. We also provide a compositional representation of our semantics in terms of an intersection type system. In the second part, we present a method of proving AST for non-affine programs, i.e., recursive programs that can, during the evaluation of the recursive body, make multiple recursive calls (of a first-order function) from distinct call sites. Unlike in a deterministic language, the number of recursion call sites has direct consequences on the termination probability. Our framework supports a proof system that can verify AST for programs that are well beyond the scope of existing methods. We have constructed prototype implementations of our method of computing lower bounds of termination probability, and AST verification.
Recursive queries have been traditionally studied in the framework of datalog, a language that restricts recursion to monotone queries over sets, which is guaranteed to converge in polynomial time in the size of the input. But modern big data systems require recursive computations beyond the Boolean space. In this paper we study the convergence of datalog when it is interpreted over an arbitrary semiring. We consider an ordered semiring, define the semantics of a datalog program as a least fixpoint in this semiring, and study the number of steps required to reach that fixpoint, if ever. We identify algebraic properties of the semiring that correspond to certain convergence properties of datalog programs. Finally, we describe a class of ordered semirings on which one can use the semi-naive evaluation algorithm on any datalog program.
In this paper, a new technique for the optimization of (partially) bound queries over disjunctive Datalog programs with stratified negation is presented. The technique exploits the propagation of query bindings and extends the Magic Set (MS) optimization technique. An important feature of disjunctive Datalog is nonmonotonicity, which calls for nondeterministic implementations, such as backtracking search. A distinguishing characteristic of the new method is that the optimization can be exploited also during the nondeterministic phase. In particular, after some assumptions have been made during the computation, parts of the program may become irrelevant to a query under these assumptions. This allows for dynamic pruning of the search space. In contrast, the effect of the previously defined MS methods for disjunctive Datalog is limited to the deterministic portion of the process. In this way, the potential performance gain by using the proposed method can be exponential, as could be observed empirically. The correctness of MS is established thanks to a strong relationship between MS and unfounded sets that has not been studied in the literature before. This knowledge allows for extending the method also to programs with stratified negation in a natural way. The proposed method has been implemented in DLV and various experiments have been conducted. Experimental results on synthetic data confirm the utility of MS for disjunctive Datalog, and they highlight the computational gain that may be obtained by the new method w.r.t. the previously proposed MS methods for disjunctive Datalog programs. Further experiments on real-world data show the benefits of MS within an application scenario that has received considerable attention in recent years, the problem of answering user queries over possibly inconsistent databases originating from integration of autonomous sources of information.
Materialisation is often used in RDF systems as a preprocessing step to derive all facts implied by given RDF triples and rules. Although widely used, materialisation considers all possible rule applications and can use a lot of memory for storing the derived facts, which can hinder performance. We present a novel materialisation technique that compresses the RDF triples so that the rules can sometimes be applied to multiple facts at once, and the derived facts can be represented using structure sharing. Our technique can thus require less space, as well as skip certain rule applications. Our experiments show that our technique can be very effective: when the rules are relatively simple, our system is both faster and requires less memory than prominent state-of-the-art RDF systems.