No Arabic abstract
There is growing interest in the use of Knowledge Graphs (KGs) for the representation, exchange, and reuse of scientific data. While KGs offer the prospect of improving the infrastructure for working with scalable and reusable scholarly data consistent with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles, the state-of-the-art Data Management Systems (DMSs) for processing large KGs leave somewhat to be desired. In this paper, we studied the performance of some of the major DMSs in the context of querying KGs with the goal of providing a finely-grained, comparative analysis of DMSs representing each of the four major DMS types. We experimented with four well-known scientific KGs, namely, Allie, Cellcycle, DrugBank, and LinkedSPL against Virtuoso, Blazegraph, RDF-3X, and MongoDB as the representative DMSs. Our results suggest that the DMSs display limitations in processing complex queries on the KG datasets. Depending on the query type, the performance differentials can be several orders of magnitude. Also, no single DMS appears to offer consistently superior performance. We present an analysis of the underlying issues and outline two integrated approaches and proposals for resolving the problem.
Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process. Key insights: * Automation in data science aims to facilitate and transform the work of data scientists, not to replace them. * Important parts of data science are already being automated, especially in the modeling stages, where techniques such as automated machine learning (AutoML) are gaining traction. * Other aspects are harder to automate, not only because of technological challenges, but because open-ended and context-dependent tasks require human interaction.
Entity alignment (EA) aims to find equivalent entities in different knowledge graphs (KGs). Current EA approaches suffer from scalability issues, limiting their usage in real-world EA scenarios. To tackle this challenge, we propose LargeEA to align entities between large-scale KGs. LargeEA consists of two channels, i.e., structure channel and name channel. For the structure channel, we present METIS-CPS, a memory-saving mini-batch generation strategy, to partition large KGs into smaller mini-batches. LargeEA, designed as a general tool, can adopt any existing EA approach to learn entities structural features within each mini-batch independently. For the name channel, we first introduce NFF, a name feature fusion method, to capture rich name features of entities without involving any complex training process. Then, we exploit a name-based data augmentation to generate seed alignment without any human intervention. Such design fits common real-world scenarios much better, as seed alignment is not always available. Finally, LargeEA derives the EA results by fusing the structural features and name features of entities. Since no widely-acknowledged benchmark is available for large-scale EA evaluation, we also develop a large-scale EA benchmark called DBP1M extracted from real-world KGs. Extensive experiments confirm the superiority of LargeEA against state-of-the-art competitors.
It is a fact that, when developing a new application, it is virtually impossible to reuse, as-is, existing datasets. This difficulty is the cause of additional costs, with the further drawback that the resulting application will again be hardly reusable. It is a negative loop which consistently reinforces itself and for which there seems to be no way out. iTelos is a general purpose methodology designed to break this loop. Its main goal is to generate reusable Knowledge Graphs (KGs), built reusing, as much as possible, already existing data. The key assumption is that the design of a KG should be done middle-out meaning by this that the design should take into consideration, in all phases of the development: (i) the purpose to be served, that we formalize as a set of competency queries, (ii) a set of pre-existing datasets, possibly extracted from existing KGs, and (iii) a set of pre-existing reference schemas, whose goal is to facilitate sharability. We call these reference schemas, teleologies, as distinct from ontologies, meaning by this that, while having a similar purpose, they are designed to be easily adapted, thus becoming a key enabler of itelos.
Gaia is an ambitious space astrometry mission of ESA with a main objective to map the sky in astrometry and photometry down to a magnitude 20 by the end of the next decade. While the mission is built and operated by ESA and an industrial consortium, the data processing is entrusted to a consortium formed by the scientific community, which was formed in 2006 and formally selected by ESA one year later. The satellite will downlink around 100 TB of raw telemetry data over a mission duration of 5 years from which a very complex iterative processing will lead to the final science output: astrometry with a final accuracy of a few tens of microarcseconds, epoch photometry in wide and narrow bands, radial velocity and spectra for the stars brighter than 17 mag. We discuss the general principles and main difficulties of this very large data processing and present the organisation of the European Consortium responsible for its design and implementation.
The chase is a well-established family of algorithms used to materialize Knowledge Bases (KBs), like Knowledge Graphs (KGs), to tackle important tasks like query answering under dependencies or data cleaning. A general problem of chase algorithms is that they might perform redundant computations. To counter this problem, we introduce the notion of Trigger Graphs (TGs), which guide the execution of the rules avoiding redundant computations. We present the results of an extensive theoretical and empirical study that seeks to answer when and how TGs can be computed and what are the benefits of TGs when applied over real-world KBs. Our results include introducing algorithms that compute (minimal) TGs. We implemented our approach in a new engine, and our experiments show that it can be significantly more efficient than the chase enabling us to materialize KBs with 17B facts in less than 40 min on commodity machines.