Lachesis: Automatic Partitioning for UDF-Centric Analytics

80 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jia Zou

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jia Zou - Amitabh Das - Pratik Barhate

قواعد البيانات النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Persistent partitioning is effective in avoiding expensive shuffling operations. However it remains a significant challenge to automate this process for Big Data analytics workloads that extensively use user defined functions (UDFs), where sub-computations are hard to be reused for partitionings compared to relational applications. In addition, functional dependency that is widely utilized for partitioning selection is often unavailable in the unstructured data that is ubiquitous in UDF-centric analytics. We propose the Lachesis system, which represents UDF-centric workloads as workflows of analyzable and reusable sub-computations. Lachesis further adopts a deep reinforcement learning model to infer which sub-computations should be used to partition the underlying data. This analysis is then applied to automatically optimize the storage of the data across applications to improve the performance and users productivity.

قيم البحث

146 - Jeff LeFevre , Jagan Sankaranarayanan , Hakan Hacigumus 2013

Emerging data analysis involves the ingestion and exploration of new data sets, application of complex functions, and frequent query revisions based on observing prior query answers. We call this new type of analysis evolutionary analytics and identi fy its properties. This type of analysis is not well represented by current benchmark workloads. In this paper, we present a workload and identify several metrics to test system support for evolutionary analytics. Along with our metrics, we present methodologies for running the workload that capture this analytical scenario.

قواعد البيانات النظم الموزعة والتوازية والحوسبة العنقودية الأداء

Pruned Landmark Labeling Meets Vertex Centric Computation: A Surprisingly Happy Marriage!

75 - Ruoming Jin , Zhen Peng , Wendell Wu 2019

In this paper, we study how the Pruned Landmark Labeling (PPL) algorithm can be parallelized in a scalable fashion, producing the same results as the sequential algorithm. More specifically, we parallelize using a Vertex-Centric (VC) computational mo del on a modern SIMD powered multicore architecture. We design a new VC-PLL algorithm that resolves the apparent mismatch between the inherent sequential dependence of the PLL algorithm and the Vertex- Centric (VC) computing model. Furthermore, we introduce a novel batch execution model for VC computation and the BVC-PLL algorithm to reduce the computational inefficiency in VC-PLL. Quite surprisingly, the theoretical analysis reveals that under a reasonable assumption, BVC-PLL has lower computational and memory access costs than PLL and indicates it may run faster than PLL as a sequential algorithm. We also demonstrate how BVC-PLL algorithm can be extended to handle directed graphs and weighted graphs and how it can utilize the hierarchical parallelism on a modern parallel computing architecture. Extensive experiments on real-world graphs not only show the sequential BVC-PLL can run more than two times faster than the original PLL, but also demonstrates its parallel efficiency and scalability.

قواعد البيانات النظم الموزعة والتوازية والحوسبة العنقودية

Metall: A Persistent Memory Allocator For Data-Centric Analytics

190 - Keita Iwabuchi 2021

Data analytics applications transform raw input data into analytics-specific data structures before performing analytics. Unfortunately, such data ingestion step is often more expensive than analytics. In addition, various types of NVRAM devices are already used in many HPC systems today. Such devices will be useful for storing and reusing data structures beyond a single process life cycle. We developed Metall, a persistent memory allocator built on top of the memory-mapped file mechanism. Metall enables applications to transparently allocate custom C++ data structures into various types of persistent memories. Metall incorporates a concise and high-performance memory management algorithm inspired by Supermalloc and the rich C++ interface developed by Boost.Interprocess library. On a dynamic graph construction workload, Metall achieved up to 11.7x and 48.3x performance improvements over Boost.Interprocess and memkind (PMEM kind), respectively. We also demonstrate Metalls high adaptability by integrating Metall into a graph processing framework, GraphBLAS Template Library. This studys outcomes indicate that Metall will be a strong tool for accelerating future large-scale data analytics by allowing applications to leverage persistent memory efficiently.

النظم الموزعة والتوازية والحوسبة العنقودية

Modern Data Formats for Big Bioinformatics Data Analytics

105 - Shahzad Ahmed , M. Usman Ali , Javed Ferzund 2017

Next Generation Sequencing (NGS) technology has resulted in massive amounts of proteomics and genomics data. This data is of no use if it is not properly analyzed. ETL (Extraction, Transformation, Loading) is an important step in designing data analy tics applications. ETL requires proper understanding of features of data. Data format plays a key role in understanding of data, representation of data, space required to store data, data I/O during processing of data, intermediate results of processing, in-memory analysis of data and overall time required to process data. Different data mining and machine learning algorithms require input data in specific types and formats. This paper explores the data formats used by different tools and algorithms and also presents modern data formats that are used on Big Data Platform. It will help researchers and developers in choosing appropriate data format to be used for a particular tool or algorithm.

قواعد البيانات أجهزة الكمبيوتر والمجتمع النظم الموزعة والتوازية والحوسبة العنقودية

The Unified Logging Infrastructure for Data Analytics at Twitter

376 - George Lee , Jimmy Lin , Chuang Liu 2012

In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are colle cted and structured to begin with. In this paper, we present Twitters production logging infrastructure and its evolution from application-specific logging to a unified client events log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize session sequences, which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service.

قواعد البيانات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد الوطني للإدارة العامة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Lachesis: Automatic Partitioning for UDF-Centric Analytics

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً