Query-driven Data Completeness Management (PhD Thesis)

273 0 0.0 ( 0 )

Download Cite

Added by Simon Razniewski

Publication date 2014

fields Informatics Engineering

and research's language is English

Authors Simon Razniewski

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Knowledge about data completeness is essentially in data-supported decision making. In this thesis we present a framework for metadata-based assessment of database completeness. We discuss how to express information about data completeness and how to use such information to draw conclusions about the completeness of query answers. In particular, we introduce formalisms for stating completeness for parts of relational databases. We then present techniques for drawing inferences between such statements and statements about the completeness of query answers, and show how the techniques can be extended to databases that contain null values. We show that the framework for relational databases can be transferred to RDF data, and that a similar framework can also be applied to spatial data. We also discuss how completeness information can be verified over processes, and introduce a data-aware process model that allows this verification.

rate research

The Virtual Observatory Powered PhD Thesis

261 - Ivan Zolotukhin 2010

The Virtual Observatory has reached sufficient maturity for its routine scientific exploitation by astronomers. To prove this statement, here I present a brief description of the complete VO-powered PhD thesis entitled Galactic and extragalactic research with modern surveys and the Virtual Observatory comprising 4 science cases covering various aspects of astrophysical research. These comprize: (1) homogeneous search and measurement of main physical parameters of Galactic open star clusters in huge multi-band photometric surveys; (2) study of optical-to-NIR galaxy colors using a large homogeneous dataset including spectroscopy and photometry from SDSS and UKIDSS; (3) study of faint low-mass X-ray binary population in modern observational archives; (4) search for optical counterparts of unidentified X-ray objects with large positional uncertainties in the Galactic Plane. All these projects make heavy use of the VO technologies and tools and would not be achievable without them. So refereed papers published in the frame of this thesis can undoubtedly be added to the growing list of VO-based research works.

Instrumentation and Methods for Astrophysics

Quantum Monte Carlo study of ultracold gases (PhD thesis)

238 - G. E. Astrakharchik 2014

This Dissertation presents results of a thorough study of ultracold bosonic and fermionic gases in three-dimensional and quasi-one-dimensional systems. Although the analyses are carried out within various theoretical frameworks (Gross-Pitaevskii, Bethe ansatz, local density approximation, etc.) the main tool of the study is the Quantum Monte Carlo method in different modifications (variational Monte Carlo, diffusion Monte Carlo, fixed-node Monte Carlo methods). We benchmark our Monte Carlo calculations by recovering known analytical results (perturbative theories in dilute limits, exactly solvable models, etc.) and extend calculations to regimes, where the results are so far unknown. In particular we calculate the equation of state and correlation functions for gases in various geometries and with various interatomic interactions.

Quantum Gases

Data Lake Ingestion Management

200 - Yan Zhao , Imen Megdiche , Franck Ravat 2021

Data Lake (DL) is a Big Data analysis solution which ingests raw data in their native format and allows users to process these data upon usage. Data ingestion is not a simple copy and paste of data, it is a complicated and important phase to ensure that ingested data are findable, accessible, interoperable and reusable at all times. Our solution is threefold. Firstly, we propose a metadata model that includes information about external data sources, data ingestion processes, ingested data, dataset veracity and dataset security. Secondly, we present the algorithms that ensure the ingestion phase (data storage and metadata instanciation). Thirdly, we introduce a developed metadata management system whereby users can easily consult different elements stored in DL.

Databases

Query-Driven Sampling for Collective Entity Resolution

72 - Christan Grant , Daisy Zhe Wang , Michael L. Wick 2015

Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeling. Many of these efforts are based on batch oriented inference which inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators --- selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions.

Databases

ArchaeoDAL: A Data Lake for Archaeological Data Management and Analytics

124 - Pengfei Liu 2021

With new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data) and can be structured, semi-structured and unstructured. Such variety makes data difficult to collect, store, manage, search and analyze effectively. A few approaches have been proposed, but none of them covers the full data lifecycle nor provides an efficient data management system. Hence, we propose the use of a data lake to provide centralized data stores to host heterogeneous data, as well as tools for data quality checking, cleaning, transformation, and analysis. In this paper, we propose a generic, flexible and complete data lake architecture. Our metadata management system exploits goldMEDAL, which is the most complete metadata model currently available. Finally, we detail the concrete implementation of this architecture dedicated to an archaeological project.

Databases