The Unified Logging Infrastructure for Data Analytics at Twitter

751 0 0.0 ( 0 )

Download Cite

Added by Jimmy Lin

Publication date 2012

fields Informatics Engineering

and research's language is English

Authors George Lee - Jimmy Lin - Chuang Liu

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are collected and structured to begin with. In this paper, we present Twitters production logging infrastructure and its evolution from application-specific logging to a unified client events log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize session sequences, which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service.

rate research

Modern Data Formats for Big Bioinformatics Data Analytics

105 - Shahzad Ahmed , M. Usman Ali , Javed Ferzund 2017

Next Generation Sequencing (NGS) technology has resulted in massive amounts of proteomics and genomics data. This data is of no use if it is not properly analyzed. ETL (Extraction, Transformation, Loading) is an important step in designing data analytics applications. ETL requires proper understanding of features of data. Data format plays a key role in understanding of data, representation of data, space required to store data, data I/O during processing of data, intermediate results of processing, in-memory analysis of data and overall time required to process data. Different data mining and machine learning algorithms require input data in specific types and formats. This paper explores the data formats used by different tools and algorithms and also presents modern data formats that are used on Big Data Platform. It will help researchers and developers in choosing appropriate data format to be used for a particular tool or algorithm.

Databases Computers and Society Distributed Parallel and Cluster Computing

LOgging UnifieD for ASTRI Mini Array

96 - Federico Incardona , Alessandro Costa , Kevin Munari 2021

The ASTRI (Astrofisica con Specchi a Tecnologia Replicante Italiana) Mini-Array (MA) project is an international collaboration led by the Italian National Institute for Astrophysics (INAF). ASTRI MA is composed of nine Cherenkov telescopes operating in the energy range 1-100 TeV, and it aims to study very high-energy gamma ray astrophysics and optical intensity interferometry of bright stars. ASTRI MA is currently under construction, and will be installed at the site of the Teide Observatory in Tenerife (Spain). The hardware and software system that is responsible of monitoring and controlling all the operations carried out at the ASTRI MA site is the Supervision Control and Data Acquisition (SCADA). The LOgging UnifieD (LOUD) subsystem is one of the main components of SCADA. It provides the service responsible for collecting, filtering, exposing and storing log events collected by all the array elements (telescopes, LIDAR, devices, etc.). In this paper, we present the LOUD architecture and the software stack explicitly designed for distributed computing environments exploiting Internet of Things technologies (IoT).

Instrumentation and Methods for Astrophysics

ArchaeoDAL: A Data Lake for Archaeological Data Management and Analytics

124 - Pengfei Liu 2021

With new emerging technologies, such as satellites and drones, archaeologists collect data over large areas. However, it becomes difficult to process such data in time. Archaeological data also have many different formats (images, texts, sensor data) and can be structured, semi-structured and unstructured. Such variety makes data difficult to collect, store, manage, search and analyze effectively. A few approaches have been proposed, but none of them covers the full data lifecycle nor provides an efficient data management system. Hence, we propose the use of a data lake to provide centralized data stores to host heterogeneous data, as well as tools for data quality checking, cleaning, transformation, and analysis. In this paper, we propose a generic, flexible and complete data lake architecture. Our metadata management system exploits goldMEDAL, which is the most complete metadata model currently available. Finally, we detail the concrete implementation of this architecture dedicated to an archaeological project.

Databases

Adaptive Logging for Distributed In-memory Databases

574 - Chang Yao , Divyakant Agrawal , Gang Chen 2015

A new type of logs, the command log, is being employed to replace the traditional data log (e.g., ARIES log) in the in-memory databases. Instead of recording how the tuples are updated, a command log only tracks the transactions being executed, thereby effectively reducing the size of the log and improving the performance. Command logging on the other hand increases the cost of recovery, because all the transactions in the log after the last checkpoint must be completely redone in case of a failure. In this paper, we first extend the command logging technique to a distributed environment, where all the nodes can perform recovery in parallel. We then propose an adaptive logging approach by combining data logging and command logging. The percentage of data logging versus command logging becomes an optimization between the performance of transaction processing and recovery to suit different OLTP applications. Our experimental study compares the performance of our proposed adaptive logging, ARIES-style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.

Databases

A Hybrid ICT-Solution for Smart Meter Data Analytics

156 - Xiufeng Liu , Per Sieverts Nielsen 2016

Smart meters are increasingly used worldwide. Smart meters are the advanced meters capable of measuring energy consumption at a fine-grained time interval, e.g., every 15 minutes. Smart meter data are typically bundled with social economic data in analytics, such as meter geographic locations, weather conditions and user information, which makes the data sets very sizable and the analytics complex. Data mining and emerging cloud computing technologies make collecting, processing, and analyzing the so-called big data possible. This paper proposes an innovative ICT-solution to streamline smart meter data analytics. The proposed solution offers an information integration pipeline for ingesting data from smart meters, a scalable platform for processing and mining big data sets, and a web portal for visualizing analytics results. The implemented system has a hybrid architecture of using Spark or Hive for big data processing, and using the machine learning toolkit, MADlib, for doing in-database data analytics in PostgreSQL database. This paper evaluates the key technologies of the proposed ICT-solution, and the results show the effectiveness and efficiency of using the system for both batch and online analytics.

Databases