ﻻ يوجد ملخص باللغة العربية
We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user finding a table using the organization. We propose the data lake organization problem as the problem of finding an organization that maximizes the expected probability of discovering tables by navigating an organization. We propose an approximate algorithm for the data lake organization problem. We show the effectiveness of the algorithm on both real data lakes containing data from open data portals and on benchmarks that emulate the observed characteristics of real data lakes. Through a formal user study, we show that navigation can help users discover relevant tables that cannot be found by keyword search. In addition, in our study, 42% of users preferred the use of navigation and 58% preferred keyword search, suggesting these are complementary and both useful modalities for data discovery in data lakes. Our experiments show that data lake organizations take into account the data lake distribution and outperform an existing hand-curated taxonomy and a common baseline organization.
Data Lake (DL) is a Big Data analysis solution which ingests raw data in their native format and allows users to process these data upon usage. Data ingestion is not a simple copy and paste of data, it is a complicated and important phase to ensure t
With new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data)
The rise of big data has revolutionized data exploitation practices and led to the emergence of new concepts. Among them, data lakes have emerged as large heterogeneous data repositories that can be analyzed by various methods. An efficient data lake
In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the
Organizations routinely accumulate semi-structured log datasets generated as the output of code; these datasets remain unused and uninterpreted, and occupy wasted space - this phenomenon has been colloquially referred to as data lake problem. One app