Knowledge Graphs for Processing Scientific Data: Challenges and Prospects

69 0 0.0 ( 0 )

Download Cite

Added by Masoud Salehpour

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Masoud Salehpour - Joseph G. Davis

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

There is growing interest in the use of Knowledge Graphs (KGs) for the representation, exchange, and reuse of scientific data. While KGs offer the prospect of improving the infrastructure for working with scalable and reusable scholarly data consistent with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles, the state-of-the-art Data Management Systems (DMSs) for processing large KGs leave somewhat to be desired. In this paper, we studied the performance of some of the major DMSs in the context of querying KGs with the goal of providing a finely-grained, comparative analysis of DMSs representing each of the four major DMS types. We experimented with four well-known scientific KGs, namely, Allie, Cellcycle, DrugBank, and LinkedSPL against Virtuoso, Blazegraph, RDF-3X, and MongoDB as the representative DMSs. Our results suggest that the DMSs display limitations in processing complex queries on the KG datasets. Depending on the query type, the performance differentials can be several orders of magnitude. Also, no single DMS appears to offer consistently superior performance. We present an analysis of the underlying issues and outline two integrated approaches and proposals for resolving the problem.

rate research

Automating Data Science: Prospects and Challenges

353 - Tijl De Bie , Luc De Raedt , Jose Hernandez-Orallo 2021

Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process. Key insights: * Automation in data science aims to facilitate and transform the work of data scientists, not to replace them. * Important parts of data science are already being automated, especially in the modeling stages, where techniques such as automated machine learning (AutoML) are gaining traction. * Other aspects are harder to automate, not only because of technological challenges, but because open-ended and context-dependent tasks require human interaction.

Databases Machine Learning

LargeEA: Aligning Entities for Large-scale Knowledge Graphs

81 - Congcong Ge , Xiaoze Liu , Lu Chen 2021

Entity alignment (EA) aims to find equivalent entities in different knowledge graphs (KGs). Current EA approaches suffer from scalability issues, limiting their usage in real-world EA scenarios. To tackle this challenge, we propose LargeEA to align entities between large-scale KGs. LargeEA consists of two channels, i.e., structure channel and name channel. For the structure channel, we present METIS-CPS, a memory-saving mini-batch generation strategy, to partition large KGs into smaller mini-batches. LargeEA, designed as a general tool, can adopt any existing EA approach to learn entities structural features within each mini-batch independently. For the name channel, we first introduce NFF, a name feature fusion method, to capture rich name features of entities without involving any complex training process. Then, we exploit a name-based data augmentation to generate seed alignment without any human intervention. Such design fits common real-world scenarios much better, as seed alignment is not always available. Finally, LargeEA derives the EA results by fusing the structural features and name features of entities. Since no widely-acknowledged benchmark is available for large-scale EA evaluation, we also develop a large-scale EA benchmark called DBP1M extracted from real-world KGs. Extensive experiments confirm the superiority of LargeEA against state-of-the-art competitors.

Databases

iTelos- Building reusable knowledge graphs

63 - Fausto Giunchiglia , Simone Bocca , Mattia Fumagalli 2021

It is a fact that, when developing a new application, it is virtually impossible to reuse, as-is, existing datasets. This difficulty is the cause of additional costs, with the further drawback that the resulting application will again be hardly reusable. It is a negative loop which consistently reinforces itself and for which there seems to be no way out. iTelos is a general purpose methodology designed to break this loop. Its main goal is to generate reusable Knowledge Graphs (KGs), built reusing, as much as possible, already existing data. The key assumption is that the design of a KG should be done middle-out meaning by this that the design should take into consideration, in all phases of the development: (i) the purpose to be served, that we formalize as a set of competency queries, (ii) a set of pre-existing datasets, possibly extracted from existing KGs, and (iii) a set of pre-existing reference schemas, whose goal is to facilitate sharability. We call these reference schemas, teleologies, as distinct from ontologies, meaning by this that, while having a similar purpose, they are designed to be easily adapted, thus becoming a key enabler of itelos.

Databases Artificial Intelligence

Gaia: Organisation and challenges for the data processing

432 - F. Mignard , C. Bailer-Jones , U. Bastian 2007

Gaia is an ambitious space astrometry mission of ESA with a main objective to map the sky in astrometry and photometry down to a magnitude 20 by the end of the next decade. While the mission is built and operated by ESA and an industrial consortium, the data processing is entrusted to a consortium formed by the scientific community, which was formed in 2006 and formally selected by ESA one year later. The satellite will downlink around 100 TB of raw telemetry data over a mission duration of 5 years from which a very complex iterative processing will lead to the final science output: astrometry with a final accuracy of a few tens of microarcseconds, epoch photometry in wide and narrow bands, radial velocity and spectra for the stars brighter than 17 mag. We discuss the general principles and main difficulties of this very large data processing and present the organisation of the European Consortium responsible for its design and implementation.

Materializing Knowledge Bases via Trigger Graphs

362 - Efthymia Tsamoura , David Carral , Enrico Malizia 2021

The chase is a well-established family of algorithms used to materialize Knowledge Bases (KBs), like Knowledge Graphs (KGs), to tackle important tasks like query answering under dependencies or data cleaning. A general problem of chase algorithms is that they might perform redundant computations. To counter this problem, we introduce the notion of Trigger Graphs (TGs), which guide the execution of the rules avoiding redundant computations. We present the results of an extensive theoretical and empirical study that seeks to answer when and how TGs can be computed and what are the benefits of TGs when applied over real-world KBs. Our results include introducing algorithms that compute (minimal) TGs. We implemented our approach in a new engine, and our experiments show that it can be significantly more efficient than the chase enabling us to materialize KBs with 17B facts in less than 40 min on commodity machines.

Databases Artificial Intelligence