Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

A Fast, Scalable, Universal Approach For Distributed Data Aggregations

400 0 0.0 ( 0 )

Download Cite

Added by Niranda Perera

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Niranda Perera - Vibhatha Abeykoon - Chathura Widanage

Distributed Parallel and Cluster Computing Information Retrieval

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In the current era of Big Data, data engineering has transformed into an essential field of study across many branches of science. Advancements in Artificial Intelligence (AI) have broadened the scope of data engineering and opened up new applications in both enterprise and research communities. Aggregations (also termed reduce in functional programming) are an integral functionality in these applications. They are traditionally aimed at generating meaningful information on large data-sets, and today, they are being used for engineering more effective features for complex AI models. Aggregations are usually carried out on top of data abstractions such as tables/ arrays and are combined with other operations such as grouping of values. There are frameworks that excel in the said domains individually. But, we believe that there is an essential requirement for a data analytics tool that can universally integrate with existing frameworks, and thereby increase the productivity and efficiency of the entire data analytics pipeline. Cylon endeavors to fulfill this void. In this paper, we present Cylons fast and scalable aggregation operations implemented on top of a distributed in-memory table structure that universally integrates with existing frameworks.

rate research

MetaFlow: a Scalable Metadata Lookup Service for Distributed File Systems in Data Centers

120 - Peng Sun , Yonggang Wen , Ta Nguyen Binh Duong 2016

In large-scale distributed file systems, efficient meta- data operations are critical since most file operations have to interact with metadata servers first. In existing distributed hash table (DHT) based metadata management systems, the lookup service could be a performance bottleneck due to its significant CPU overhead. Our investigations showed that the lookup service could reduce system throughput by up to 70%, and increase system latency by a factor of up to 8 compared to ideal scenarios. In this paper, we present MetaFlow, a scalable metadata lookup service utilizing software-defined networking (SDN) techniques to distribute lookup workload over network components. MetaFlow tackles the lookup bottleneck problem by leveraging B-tree, which is constructed over the physical topology, to manage flow tables for SDN-enabled switches. Therefore, metadata requests can be forwarded to appropriate servers using only switches. Extensive performance evaluations in both simulations and testbed showed that MetaFlow increases system throughput by a factor of up to 3.2, and reduce system latency by a factor of up to 5 compared to DHT-based systems. We also deployed MetaFlow in a distributed file system, and demonstrated significant performance improvement.

Distributed Parallel and Cluster Computing

Designing a scalable framework for declarative automation on distributed systems

69 - J. Lowell Wofford 2021

As distributed systems grow in scale and complexity, the need for flexible automation of systems management functions also grows. We outline a framework for building tools that provide distributed, scalable, declarative, modular, and continuous automation for distributed systems. We focus on four points of design: 1) a state-management approach that prescribes source-of-truth for configured and discovered system states; 2) a technique to solve the declarative unification problem for a class of automation problems, providing state convergence and modularity; 3) an eventual-consistency approach to state synchronization which provides automation at scale; 4) an event-driven architecture that provides always-on state enforcement. We describe the methodology, software architecture for the framework, and constraints for these techniques to apply to an automation problem. We overview a reference application built on this framework that provides state-aware system provisioning and node lifecycle management, highlighting key advantages. We conclude with a discussion of current and future applications.

Distributed Parallel and Cluster Computing

RapidRAID: Pipelined Erasure Codes for Fast Data Archival in Distributed Storage Systems

423 - Lluis Pamies-Juarez , Anwitaman Datta , Frederique Oggier 2012

To achieve reliability in distributed storage systems, data has usually been replicated across different nodes. However the increasing volume of data to be stored has motivated the introduction of erasure codes, a storage efficient alternative to replication, particularly suited for archival in data centers, where old datasets (rarely accessed) can be erasure encoded, while replicas are maintained only for the latest data. Many recent works consider the design of new storage-centric erasure codes for improved repairability. In contrast, this paper addresses the migration from replication to encoding: traditionally erasure coding is an atomic operation in that a single node with the whole object encodes and uploads all the encoded pieces. Although large datasets can be concurrently archived by distributing individual object encodings among different nodes, the network and computing capacity of individual nodes constrain the archival process due to such atomicity. We propose a new pipelined coding strategy that distributes the network and computing load of single-object encodings among different nodes, which also speeds up multiple object archival. We further present RapidRAID codes, an explicit family of pipelined erasure codes which provides fast archival without compromising either data reliability or storage overheads. Finally, we provide a real implementation of RapidRAID codes and benchmark its performance using both a cluster of 50 nodes and a set of Amazon EC2 instances. Experiments show that RapidRAID codes reduce a single objects coding time by up to 90%, while when multiple objects are encoded concurrently, the reduction is up to 20%.

Distributed Parallel and Cluster Computing

EZLDA: Efficient and Scalable LDA on GPUs

67 - Shilong Wang 2020

LDA is a statistical approach for topic modeling with a wide range of applications. However, there exist very few attempts to accelerate LDA on GPUs which come with exceptional computing and memory throughput capabilities. To this end, we introduce EZLDA which achieves efficient and scalable LDA training on GPUs with the following three contributions: First, EZLDA introduces three-branch sampling method which takes advantage of the convergence heterogeneity of various tokens to reduce the redundant sampling task. Second, to enable sparsity-aware format for both D and W on GPUs with fast sampling and updating, we introduce hybrid format for W along with corresponding token partition to T and inverted index designs. Third, we design a hierarchical workload balancing solution to address the extremely skewed workload imbalance problem on GPU and scaleEZLDA across multiple GPUs. Taken together, EZLDA achieves superior performance over the state-of-the-art attempts with lower memory consumption.

Distributed Parallel and Cluster Computing Information Retrieval Machine Learning

Scalable and Secure Aggregation in Distributed Networks

332 - Sebastien Gambs , Rachid Guerraoui , Hamza Harkous 2011

We consider the problem of computing an aggregation function in a emph{secure} and emph{scalable} way. Whereas previous distributed solutions with similar security guarantees have a communication cost of $O(n^3)$, we present a distributed protocol that requires only a communication complexity of $O(nlog^3 n)$, which we prove is near-optimal. Our protocol ensures perfect security against a computationally-bounded adversary, tolerates $(1/2-epsilon)n$ malicious nodes for any constant $1/2 > epsilon > 0$ (not depending on $n$), and outputs the exact value of the aggregated function with high probability.

Distributed Parallel and Cluster Computing Computational Complexity Cryptography and Security

comments

Fetching comments

Higher Institute of Business Administration

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

A Fast, Scalable, Universal Approach For Distributed Data Aggregations

Ask ChatGPT about the research

No Arabic abstract

Read More