ﻻ يوجد ملخص باللغة العربية
Organizations routinely accumulate semi-structured log datasets generated as the output of code; these datasets remain unused and uninterpreted, and occupy wasted space - this phenomenon has been colloquially referred to as data lake problem. One approach to leverage these semi-structured datasets is to convert them into a structured relational format, following which they can be analyzed in conjunction with other datasets. We present Datamaran, an tool that extracts structure from semi-structured log datasets with no human supervision. Datamaran automatically identifies field and record endpoints, separates the structured parts from the unstructured noise or formatting, and can tease apart multiple structures from within a dataset, in order to efficiently extract structured relational datasets from semi-structured log datasets, at scale with high accuracy. Compared to other unsupervised log dataset extraction tools developed in prior work, Datamaran does not require the record boundaries to be known beforehand, making it much more applicable to the noisy log files that are ubiquitous in data lakes. Datamaran can successfully extract structured information from all datasets used in prior work, and can achieve 95% extraction accuracy on automatically collected log datasets from GitHub - a substantial 66% increase of accuracy compared to unsupervised schemes from prior work. Our user study further demonstrates that the extraction results of Datamaran are closer to the desired structure than competing algorithms.
We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges indicating s
Data Lake (DL) is a Big Data analysis solution which ingests raw data in their native format and allows users to process these data upon usage. Data ingestion is not a simple copy and paste of data, it is a complicated and important phase to ensure t
With new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data)
System logs record detailed runtime information of software systems and are used as the main data source for many tasks around software engineering. As modern software systems are evolving into large scale and complex structures, logs have become one
In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the