ﻻ يوجد ملخص باللغة العربية
Distributed data processing ecosystems are widespread and their components are highly specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was chosen by the community to serve as a format mediator, providing efficient in-memory data representation. Arrow enables efficient data movement between data processing and storage engines, significantly improving interoperability and overall performance. In this work, we design a new zero-cost data interoperability layer between Apache Spark and Arrow-based data sources through the Arrow Dataset API. Our novel data interface helps separate the computation (Spark) and data (Arrow) layers. This enables practitioners to seamlessly use Spark to access data from all Arrow Dataset API-enabled data sources and frameworks. To benefit our community, we open-source our work and show that consuming data through Apache Arrow is zero-cost: our novel data interface is either on-par or more performant than native Spark.
Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and syst
With the era of big data, an explosive amount of information is now available. This enormous increase of Big Data in both academia and industry requires large-scale data processing systems. A large body of research is behind optimizing Sparks perform
Identifying the causal relationships between subjects or variables remains an important problem across various scientific fields. This is particularly important but challenging in complex systems, such as those involving human behavior, sociotechnica
We propose TRANSMUT-Spark, a tool that automates the mutation testing process of Big Data processing code within Spark programs. Apache Spark is an engine for Big Data Processing. It hides the complexity inherent to Big Data parallel and distributed
As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory access and dat