ترغب بنشر مسار تعليمي؟ اضغط هنا

Two-level Data Staging ETL for Transaction Data

223   0   0.0 ( 0 )
 نشر من قبل Xiufeng Liu
 تاريخ النشر 2014
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English
 تأليف Xiufeng Liu




اسأل ChatGPT حول البحث

In data warehousing, Extract-Transform-Load (ETL) extracts the data from data sources into a central data warehouse regularly for the support of business decision-makings. The data from transaction processing systems are featured with the high frequent changes of insertion, update, and deletion. It is challenging for ETL to propagate the changes to the data warehouse, and maintain the change history. Moreover, ETL jobs typically run in a sequential order when processing the data with dependencies, which is not optimal, eg, when processing early-arriving data. In this paper, we propose a two-level data staging ETL for handling transaction data. The proposed method detects the changes of the data from transactional processing systems, identifies the corresponding operation codes for the changes, and uses two staging databases to facilitate the data processing in an ETL process. The proposed ETL provides the one-stop method for fast-changing, slowly-changing and early-arriving data processing.



قيم البحث

اقرأ أيضاً

Next Generation Sequencing (NGS) technology has resulted in massive amounts of proteomics and genomics data. This data is of no use if it is not properly analyzed. ETL (Extraction, Transformation, Loading) is an important step in designing data analy tics applications. ETL requires proper understanding of features of data. Data format plays a key role in understanding of data, representation of data, space required to store data, data I/O during processing of data, intermediate results of processing, in-memory analysis of data and overall time required to process data. Different data mining and machine learning algorithms require input data in specific types and formats. This paper explores the data formats used by different tools and algorithms and also presents modern data formats that are used on Big Data Platform. It will help researchers and developers in choosing appropriate data format to be used for a particular tool or algorithm.
In visual exploration and analysis of data, determining how to select and transform the data for visualization is a challenge for data-unfamiliar or inexperienced users. Our main hypothesis is that for many data sets and common analysis tasks, there are relatively few data slices that result in effective visualizations. By focusing human users on appropriate and suitably transformed parts of the underlying data sets, these data slices can help the users carry their task to correct completion. To verify this hypothesis, we develop a framework that permits us to capture exemplary data slices for a user task, and to explore and parse visual-exploration sequences into a format that makes them distinct and easy to compare. We develop a recommendation system, DataSlicer, that matches a currently viewed data slice with the most promising next effective data slices for the given exploration task. We report the results of controlled experiments with an implementation of the DataSlicer system, using four common analytical task types. The experiments demonstrate statistically significant improvements in accuracy and exploration speed versus users without access to our system.
A blockchain is an append-only linked-list of blocks, which is maintained at each participating node. Each block records a set of transactions and their associated metadata. Blockchain transactions act on the identical ledger data stored at each node . Blockchain was first perceived by Satoshi Nakamoto as a peer-to-peer digital-commodity (also known as crypto-currency) exchange system. Blockchains received traction due to their inherent property of immutability-once a block is accepted, it cannot be reverted.
In Big data era, information integration often requires abundant data extracted from massive data sources. Due to a large number of data sources, data source selection plays a crucial role in information integration, since it is costly and even impos sible to access all data sources. Data Source selection should consider both efficiency and effectiveness issues. For efficiency, the approach should achieve high performance and be scalability to fit large data source amount. From effectiveness aspect, data quality and overlapping of sources are to be considered, since data quality varies much from data sources, with significant differences in the accuracy and coverage of the data provided, and the overlapping of sources can even lower the quality of data integrated from selected data sources. In this paper, we study source selection problem in textit{Big Data Era} and propose methods which can scale to datasets with up to millions of data sources and guarantee the quality of results. Motivated by this, we propose a new object function taking the expected number of true values a source can provide as a criteria to evaluate the contribution of a data source. Based on our proposed index we present a scalable algorithm and two pruning strategies to improve the efficiency without sacrificing precision. Experimental results on both real world and synthetic data sets show that our methods can select sources providing a large proportion of true values efficiently and can scale to massive data sources.
124 - Pengfei Liu 2021
With new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data) and can be structured, semi-structured and unstructured. Such variety makes data difficult to collect, store, manage, search and analyze effectively. A few approaches have been proposed, but none of them covers the full data lifecycle nor provides an efficient data management system. Hence, we propose the use of a data lake to provide centralized data stores to host heterogeneous data, as well as tools for data quality checking, cleaning, transformation, and analysis. In this paper, we propose a generic, flexible and complete data lake architecture. Our metadata management system exploits goldMEDAL, which is the most complete metadata model currently available. Finally, we detail the concrete implementation of this architecture dedicated to an archaeological project.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا