Building Analytics Pipelines for Querying Big Streams and Data Histories with H-STREAM

130 0 0.0 ( 0 )

Download Cite

Added by Genoveva Vargas-Solar

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Genoveva Vargas-Solar - Javier A. Espinosa-Oviedo

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper introduces H-STREAM, a big stream/data processing pipelines evaluation engine that proposes stream processing operators as micro-services to support the analysis and visualisation of Big Data streams stemming from IoT (Internet of Things) environments. H-STREAM micro-services combine stream processing and data storage techniques tuned depending on the number of things producing streams, the pace at which they produce them, and the physical computing resources available for processing them online and delivering them to consumers. H-STREAM delivers stream processing and visualisation micro-services installed in a cloud environment. Micro-services can be composed for implementing specific stream aggregation analysis pipelines as queries. The paper presents an experimental validation using Microsoft Azure as a deployment environment for testing the capacity of H-STREAM for dealing with velocity and volume challenges in an (i) a neuroscience experiment and (in) a social connectivity analysis scenario running on IoT farms.

rate research

Trevor: Automatic configuration and scaling of stream processing pipelines

272 - Manu Bansal , Eyal Cidon , Arjun Balasingam 2018

Operating a distributed data stream processing workload efficiently at scale is hard. The operator of the workload must parallelize and lay out tasks of the workload with resources that match the requirement of target data rate. The challenge is that neither the operator nor the programmer is typically aware of the scaling behavior of the workload as a function of resources. An operator manually searches for a safe operating point that can handle predicted peak load and deploys with ample headroom for absorbing unpredictable spikes. Such empirical, static over-provisioning is wasteful of both compute and human resources. We show that precise performance models can be automatically learned for distributed stream processing systems that can predict the execution performance of a job even before deployment. Further, those models can be used to optimally schedule logically specified jobs onto available physical hardware. Finally, those models and the derived execution schedules can be refined online to dynamically adapt to unpredictable changes in the runtime environment or auto-scale with variations in job load.

Distributed Parallel and Cluster Computing

Cloud Big Data Mining and Analytics: Bringing Greenness and Acceleration in the Cloud

111 - Hrishav Bakul Barua , Kartick Chandra Mondal 2021

Big data is gaining overwhelming attention since the last decade. Almost all the fields of science and technology have experienced a considerable impact from it. The cloud computing paradigm has been targeted for big data processing and mining in a more efficient manner using the plethora of resources available from computing nodes to efficient storage. Cloud data mining introduces the concept of performing data mining and analytics of huge data in the cloud availing the cloud resources. But can we do better? Yes, of course! The main contribution of this chapter is the identification of four game-changing technologies for the acceleration of computing and analysis of data mining tasks in the cloud. Graphics Processing Units can be used to further accelerate the mining or analytic process, which is called GPU accelerated analytics. Further, Approximate Computing can also be introduced in big data analytics for bringing efficacy in the process by reducing time and energy and hence facilitating greenness in the entire computing process. Quantum Computing is a paradigm that is gaining pace in recent times which can also facilitate efficient and fast big data analytics in very little time. We have surveyed these three technologies and established their importance in big data mining with a holistic architecture by combining these three game-changers with the perspective of big data. We have also talked about another future technology, i.e., Neural Processing Units or Neural accelerators for researchers to explore the possibilities. A brief explanation of big data and cloud data mining concepts are also presented here.

Distributed Parallel and Cluster Computing

EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms

99 - Christian Pilato , Stanislav Bohm , Fabien Brocheton 2021

High-Performance Big Data Analytics (HPDA) applications are characterized by huge volumes of distributed and heterogeneous data that require efficient computation for knowledge extraction and decision making. Designers are moving towards a tight integration of computing systems combining HPC, Cloud, and IoT solutions with artificial intelligence (AI). Matching the application and data requirements with the characteristics of the underlying hardware is a key element to improve the predictions thanks to high performance and better use of resources. We present EVEREST, a novel H2020 project started on October 1st, 2020 that aims at developing a holistic environment for the co-design of HPDA applications on heterogeneous, distributed, and secure platforms. EVEREST focuses on programmability issues through a data-driven design approach, the use of hardware-accelerated AI, and an efficient runtime monitoring with virtualization support. In the different stages, EVEREST combines state-of-the-art programming models, emerging communication standards, and novel domain-specific extensions. We describe the EVEREST approach and the use cases that drive our research.

Distributed Parallel and Cluster Computing

Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines

525 - Francisco Romero , Mark Zhao , Neeraja J. Yadwadkar 2021

The proliferation of camera-enabled devices and large video repositories has led to a diverse set of video analytics applications. These applications rely on video pipelines, represented as DAGs of operations, to transform videos, process extracted metadata, and answer questions like, Is this intersection congested? The latency and resource efficiency of pipelines can be optimized using configurable knobs for each operation (e.g., sampling rate, batch size, or type of hardware used). However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, and (b) the optimal configuration depends on users desired latency and cost targets, (c) input video contents may exercise different paths in the DAG and produce a variable amount intermediate results. Existing video analytics and processing systems leave it to the users to manually configure operations and select hardware resources. We present Llama: a heterogeneous and serverless framework for auto-tuning video pipelines. Given an end-to-end latency target, Llama optimizes for cost efficiency by (a) calculating a latency target for each operation invocation, and (b) dynamically running a cost-based optimizer to assign configurations across heterogeneous hardware that best meet the calculated per-invocation latency target. This makes the problem of auto-tuning large video pipelines tractable and allows us to handle input-dependent behavior, conditional branches in the DAG, and execution variability. We describe the algorithms in Llama and evaluate it on a cloud platform using serverless CPU and GPU resources. We show that compared to state-of-the-art cluster and serverless video analytics and processing systems, Llama achieves 7.8x lower latency and 16x cost reduction on average.

Distributed Parallel and Cluster Computing

SoA-Fog: Secure Service-Oriented Edge Computing Architecture for Smart Health Big Data Analytics

73 - Rabindra K. Barik , Harishchandra Dubey , Kunal Mankodiya 2017

The smart health paradigms employ Internet-connected wearables for telemonitoring, diagnosis for providing inexpensive healthcare solutions. Fog computing reduces latency and increases throughput by processing data near the body sensor network. In this paper, we proposed a secure serviceorientated edge computing architecture that is validated on recently released public dataset. Results and discussions support the applicability of proposed architecture for smart health applications. We proposed SoA-Fog i.e. a three-tier secure framework for efficient management of health data using fog devices. It discuss the security aspects in client layer, fog layer and the cloud layer. We design the prototype by using win-win spiral model with use case and sequence diagram. Overlay analysis was performed using proposed framework on malaria vector borne disease positive maps of Maharastra state in India from 2011 to 2014. The mobile clients were taken as test case. We performed comparative analysis between proposed secure fog framework and state-of-the art cloud-based framework.

Distributed Parallel and Cluster Computing