Do you want to publish a course? Click here

A Survey on Spark Ecosystem for Big Data Processing

182   0   0.0 ( 0 )
 Added by Shanjiang Tang
 Publication date 2018
and research's language is English




Ask ChatGPT about the research

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation and classification of various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.



rate research

Read More

With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithms accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability.
Network Traffic Monitoring and Analysis (NTMA) represents a key component for network management, especially to guarantee the correct operation of large-scale networks such as the Internet. As the complexity of Internet services and the volume of traffic continue to increase, it becomes difficult to design scalable NTMA applications. Applications such as traffic classification and policing require real-time and scalable approaches. Anomaly detection and security mechanisms require to quickly identify and react to unpredictable events while processing millions of heterogeneous events. At last, the system has to collect, store, and process massive sets of historical data for post-mortem analysis. Those are precisely the challenges faced by general big data approaches: Volume, Velocity, Variety, and Veracity. This survey brings together NTMA and big data. We catalog previous work on NTMA that adopt big data approaches to understand to what extent the potential of big data is being explored in NTMA. This survey mainly focuses on approaches and technologies to manage the big NTMA data, additionally briefly discussing big data analytics (e.g., machine learning) for the sake of NTMA. Finally, we provide guidelines for future work, discussing lessons learned, and research directions.
With the era of big data, an explosive amount of information is now available. This enormous increase of Big Data in both academia and industry requires large-scale data processing systems. A large body of research is behind optimizing Sparks performance to make it state of the art, a fast and general data processing system. Many science and engineering fields have advanced with Big Data analytics, such as Biology, finance, and transportation. Intelligent transportation systems (ITS) gain popularity and direct benefit from the richness of information. The objective is to improve the safety and management of transportation networks by reducing congestion and incidents. The first step toward the goal is better understanding, modeling, and detecting congestion across a network efficiently and effectively. In this study, we introduce an efficient congestion detection model. The underlying network consists of 3017 segments in I-35, I-80, I-29, and I-380 freeways with an overall length of 1570 miles and averaged (0.4-0.6) miles per segment. The result of congestion detection shows the proposed method is 90% accurate while has reduced computation time by 99.88%.
We propose hMDAP, a hybrid framework for large-scale data analytical processing on Spark, to support multi-paradigm process (incl. OLAP, machine learning, and graph analysis etc.) in distributed environments. The framework features a three-layer data process module and a business process module which controls the former. We will demonstrate the strength of hMDAP by using traffic scenarios in a real world.
112 - A.V. Vaniachine 2013
The ever-increasing volumes of scientific data present new challenges for distributed computing and Grid technologies. The emerging Big Data revolution drives exploration in scientific fields including nanotechnology, astrophysics, high-energy physics, biology and medicine. New initiatives are transforming data-driven scientific fields enabling massive data analysis in new ways. In petascale data processing scientists deal with datasets, not individual files. As a result, a task (comprised of many jobs) became a unit of petascale data processing on the Grid. Splitting of a large data processing task into jobs enabled fine-granularity checkpointing analogous to the splitting of a large file into smaller TCP/IP packets during data transfers. Transferring large data in small packets achieves reliability through automatic re-sending of the dropped TCP/IP packets. Similarly, transient job failures on the Grid can be recovered by automatic re-tries to achieve reliable six sigma production quality in petascale data processing on the Grid. The computing experience of the ATLAS and CMS experiments provides foundation for reliability engineering scaling up Grid technologies for data processing beyond the petascale.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا