ﻻ يوجد ملخص باللغة العربية
Identifying the causal relationships between subjects or variables remains an important problem across various scientific fields. This is particularly important but challenging in complex systems, such as those involving human behavior, sociotechnical contexts, and natural ecosystems. By exploiting state space reconstruction via lagged embedding of time series, convergent cross mapping (CCM) serves as an important method for addressing this problem. While powerful, CCM is computationally costly; moreover, CCM results are highly sensitive to several parameter values. While best practice entails exploring a range of parameter settings when assessing casual relationships, the resulting computational burden can raise barriers to practical use, especially for long time series exhibiting weak causal linkages. We demonstrate here several means of accelerating CCM by harnessing the distributed Apache Spark platform. We characterize and report on results of several experiments with parallelized solutions that demonstrate high scalability and a capacity for over an order of magnitude performance improvement for the baseline configuration. Such economies in computation time can speed learning and robust identification of causal drivers in complex systems.
With the era of big data, an explosive amount of information is now available. This enormous increase of Big Data in both academia and industry requires large-scale data processing systems. A large body of research is behind optimizing Sparks perform
Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and syst
Distributed data processing ecosystems are widespread and their components are highly specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was chosen by the community to serve as a format mediator, providing efficient i
As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory access and dat
We propose TRANSMUT-Spark, a tool that automates the mutation testing process of Big Data processing code within Spark programs. Apache Spark is an engine for Big Data Processing. It hides the complexity inherent to Big Data parallel and distributed