ترغب بنشر مسار تعليمي؟ اضغط هنا

Aggregation by Provenance Types: A Technique for Summarising Provenance Graphs

155   0   0.0 ( 0 )
 نشر من قبل EPTCS
 تاريخ النشر 2015
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English
 تأليف Luc Moreau




اسأل ChatGPT حول البحث

As users become confronted with a deluge of provenance data, dedicated techniques are required to make sense of this kind of information. We present Aggregation by Provenance Types, a provenance graph analysis that is capable of generating provenance graph summaries. It proceeds by converting provenance paths up to some length k to attributes, referred to as provenance types, and by grouping nodes that have the same provenance types. The summary also includes numeric values representing the frequency of nodes and edges in the original graph. A quantitative evaluation and a complexity analysis show that this technique is tractable; with small values of k, it can produce useful summaries and can help detect outliers. We illustrate how the generated summaries can further be used for conformance checking and visualization.

قيم البحث

اقرأ أيضاً

Scientific workflow systems increasingly store provenance information about the module executions used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions. However, authors/owners of workflows may wish to keep some of this information confidential. In particular, a module may be proprietary, and users should not be able to infer its behavior by seeing mappings between all data inputs and outputs. The problem we address in this paper is the following: Given a workflow, abstractly modeled by a relation R, a privacy requirement Gamma and costs associated with data. The owner of the workflow decides which data (attributes) to hide, and provides the user with a view R which is the projection of R over attributes which have not been hidden. The goal is to minimize the cost of hidden data while guaranteeing that individual modules are Gamma -private. We call this the secureview problem. We formally define the problem, study its complexity, and offer algorithmic solutions.
Explaining why an answer is in the result of a query or why it is missing from the result is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. Both types of questions, i.e ., why and why-not provenance, have been studied extensively. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Our approach is based on a rewriting of Datalog rules (called firing rules) that captures successful rule derivations within the context of a Datalog query. We extend this rewriting to support negation and to capture failed derivations that explain missing answers. Given a (why or why-not) provenance question, we compute an explanation, i.e., the part of the provenance that is relevant to answer the question. We introduce optimizations that prune parts of a provenance graph early on if we can determine that they will not be part of the explanation for a given question. We present an implementation that runs on top of a relational database using SQL to compute explanations. Our experiments demonstrate that our approach scales to large instances and significantly outperforms an earlier approach which instantiates the full provenance to compute explanations.
87 - James Cheney , Amal Ahmed , 2009
Provenance is information recording the source, derivation, or history of some information. Provenance tracking has been studied in a variety of settings; however, although many design points have been explored, the mathematical or semantic foundatio ns of data provenance have received comparatively little attention. In this paper, we argue that dependency analysis techniques familiar from program analysis and program slicing provide a formal foundation for forms of provenance that are intended to show how (part of) the output of a query depends on (parts of) its input. We introduce a semantic characterization of such dependency provenance, show that this form of provenance is not computable, and provide dynamic and static approximation techniques.
231 - Xing Niu , Ziyu Liu , Pengyuan Li 2021
Database systems analyze queries to determine upfront which data is needed for answering them and use indexes and other physical design techniques to speed-up access to that data. However, for important classes of queries, e.g., HAVING and top-k quer ies, it is impossible to determine up-front what data is relevant. To overcome this limitation, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches to concisely encode what data is relevant for a query. Once a provenance sketch has been captured it is used to speed up subsequent queries. PBDS can exploit physical design artifacts such as indexes and zone maps. Our approach significantly improves performance for both disk-based and main-memory database systems.
Knowledge graphs (KG) that model the relationships between entities as labeled edges (or facts) in a graph are mostly constructed using a suite of automated extractors, thereby inherently leading to uncertainty in the extracted facts. Modeling the un certainty as probabilistic confidence scores results in a probabilistic knowledge graph. Graph queries over such probabilistic KGs require answer computation along with the computation of those result probabilities, aka, probabilistic inference. We propose a system, HAPPI (How Provenance of Probabilistic Inference), to handle such query processing. Complying with the standard provenance semiring model, we propose a novel commutative semiring to symbolically compute the probability of the result of a query. These provenance-polynomiallike symbolic expressions encode fine-grained information about the probability computation process. We leverage this encoding to efficiently compute as well as maintain the probability of results as the underlying KG changes. Focusing on a popular class of conjunctive basic graph pattern queries on the KG, we compare the performance of HAPPI against a possible-world model of computation and a knowledge compilation tool over two large datasets. We also propose an adaptive system that leverages the strengths of both HAPPI and compilation based techniques. Since existing systems for probabilistic databases mostly focus on query computation, they default to re-computation when facts in the KG are updated. HAPPI, on the other hand, does not just perform probabilistic inference and maintain their provenance, but also provides a mechanism to incrementally maintain them as the KG changes. We extend this maintainability as part of our proposed adaptive system.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا