ﻻ يوجد ملخص باللغة العربية
The objective of this work was to utilize BigBench [1] as a Big Data benchmark and evaluate and compare two processing engines: MapReduce [2] and Spark [3]. MapReduce is the established engine for processing data on Hadoop. Spark is a popular alternative engine that promises faster processing times than the established MapReduce engine. BigBench was chosen for this comparison because it is the first end-to-end analytics Big Data benchmark and it is currently under public review as TPCx-BB [4]. One of our goals was to evaluate the benchmark by performing various scalability tests and validate that it is able to stress test the processing engines. First, we analyzed the steps necessary to execute the available MapReduce implementation of BigBench [1] on Spark. Then, all the 30 BigBench queries were executed on MapReduce/Hive with different scale factors in order to see how the performance changes with the increase of the data size. Next, the group of HiveQL queries were executed on Spark SQL and compared with their respective Hive runtimes. This report gives a detailed overview on how to setup an experimental Hadoop cluster and execute BigBench on both Hive and Spark SQL. It provides the absolute times for all experiments preformed for different scale factors as well as query results which can be used to validate correct benchmark execution. Additionally, multiple issues and workarounds were encountered and solved during our work. An evaluation of the resource utilization (CPU, memory, disk and network usage) of a subset of representative BigBench queries is presented to illustrate the behavior of the different query groups on both processing engines. Last but not least it is important to mention that larger parts of this report are taken from the master thesis of Max-Georg Beer, entitled Evaluation of BigBench on Apache Spark Compared to MapReduce [5].
The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which promise an i
We propose TRANSMUT-Spark, a tool that automates the mutation testing process of Big Data processing code within Spark programs. Apache Spark is an engine for Big Data Processing. It hides the complexity inherent to Big Data parallel and distributed
Geo-replication poses an inherent trade-off between low latency, high availability and strong consistency. While NoSQL databases favor low latency and high availability, relaxing consistency, more recent cloud databases favor strong consistency and e
The rule-based OWL reasoning is to compute the deductive closure of an ontology by applying RDF/RDFS and OWL entailment rules. The performance of the rule-based OWL reasoning is often sensitive to the rule execution order. In this paper, we present a
We consider the problem of finding equivalent minimal-size reformulations of SQL queries in presence of embedded dependencies [1]. Our focus is on select-project-join (SPJ) queries with equality comparisons, also known as safe conjunctive (CQ) querie