ﻻ يوجد ملخص باللغة العربية
Assessing and improving the quality of data are fundamental challenges for data-intensive systems that have given rise to applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data migration are now reasonably well understood in isolation, not much attention has been given to the interplay between the tools addressing issues in these areas. We focus on the problem of determining whether the available data-processing procedures can be used together to bring about the desired quality of the given data. For instance, consider an organization introducing new data-analysis tasks. Depending on the tasks, it may be a priority to determine whether the data can be processed and transformed using the available data-processing tools to satisfy certain properties or quality assurances needed for the success of the task. Here, while the organization may control some of its tools, some other tools may be external or proprietary, with only basic information available on how they process data. The problem is then, how to decide which tools to apply, and in which order, to make the data ready for the new tasks? Toward addressing this problem, we develop a new framework that abstracts data-processing tools as black-box procedures with only some of the properties exposed, such as the applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. We show how common tasks such as data cleaning and data migration are encapsulated into our framework and, as a proof of concept, we study basic properties of the framework for the case of procedures described by standard relational constraints. While reasoning in this framework may be computationally infeasible in general, we show that there exist well-behaved special cases with potential practical applications.
Assessing and improving the quality of data in data-intensive systems are fundamental challenges that have given rise to numerous applications targeting transformation and cleaning of data. However, while schema design, data cleaning, and data migrat
High-quality data is critical to train performant Machine Learning (ML) models, highlighting the importance of Data Quality Management (DQM). Existing DQM schemes often cannot satisfactorily improve ML performance because, by design, they are oblivio
It is important for big data systems to identify their performance bottleneck. However, the popular indicators such as resource utilizations, are often misleading and incomparable with each other. In this paper, a novel indicator framework which can
We propose hMDAP, a hybrid framework for large-scale data analytical processing on Spark, to support multi-paradigm process (incl. OLAP, machine learning, and graph analysis etc.) in distributed environments. The framework features a three-layer data
Data metrology -- the assessment of the quality of data -- particularly in scientific and industrial settings, has emerged as an important requirement for the UK National Physical Laboratory (NPL) and other national metrology institutes. Data provena