No Arabic abstract
New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. Exascale (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing a profound phase of innovation, in which the primary challenge to the achievement of the Exascale is the power-consumption. The goal of this work is to give some insights about performance and energy footprint of contemporary architectures for a real astrophysical application in an HPC context. We use a state-of-the-art N-body application that we re-engineered and optimized to exploit the heterogeneous underlying hardware fully. We quantitatively evaluate the impact of computation on energy consumption when running on four different platforms. Two of them represent the current HPC systems (Intel-based and equipped with NVIDIA GPUs), one is a micro-cluster based on ARM-MPSoC, and one is a prototype towards Exascale equipped with ARM-MPSoCs tightly coupled with FPGAs. We investigate the behavior of the different devices where the high-end GPUs excel in terms of time-to-solution while MPSoC-FPGA systems outperform GPUs in power consumption. Our experience reveals that considering FPGAs for computationally intensive application seems very promising, as their performance is improving to meet the requirements of scientific applications. This work can be a reference for future platforms development for astrophysics applications where computationally intensive calculations are required.
The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other -- less-known -- machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation Neural Network (BCPNN). In this paper, we introduce StreamBrain -- a framework that allows neural networks based on BCPNN to be practically deployed in High-Performance Computing systems. StreamBrain is a domain-specific language (DSL), similar in concept to existing machine learning (ML) frameworks, and supports backends for CPUs, GPUs, and even FPGAs. We empirically demonstrate that StreamBrain can train the well-known ML benchmark dataset MNIST within seconds, and we are the first to demonstrate BCPNN on STL-10 size networks. We also show how StreamBrain can be used to train with custom floating-point formats and illustrate the impact of using different bfloat variations on BCPNN using FPGAs.
In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Landing processor, and an NVIDIA Tesla K40c GPU. The conventional way of storing the discretised derivatives to global arrays for solution advancement is found to be inefficient in terms of energy consumption and runtime. In contrast, a class of algorithms in which the discretised derivatives are evaluated on-the-fly or stored as thread-/process-local variables (yielding high compute intensity) is optimal both with respect to energy consumption and runtime. On all three hardware architectures considered, a speed-up of ~2 and an energy saving of ~2 are observed for the high compute intensive algorithms compared to the memory intensive algorithm. The energy consumption is found to be proportional to runtime, irrespective of the power consumed and the GPU has an energy saving of ~5 compared to the same algorithm on a CPU node.
Understanding network and application performance are essential for debugging, improving user experience, and performance comparison. Meanwhile, modern mobile systems are optimized for energy-efficient computation and communications that may limit the performance of network and applications. In recent years, several tools have emerged that analyze network performance of mobile applications in~situ with the help of the VPN service. There is a limited understanding of how these measurement tools and system optimizations affect the network and application performance. In this study, we first demonstrate that mobile systems employ energy-aware system hardware tuning, which affects application performance and network throughput. We next show that the VPN-based application performance measurement tools, such as Lumen, PrivacyGuard, and Video Optimizer, aid in ambiguous network performance measurements and degrade the application performance. Our findings suggest that sound application and network performance measurement on Android devices requires a good understanding of the device, networks, measurement tools, and applications.
The movement of large-scale (tens of Terabytes and larger) data sets between high performance computing (HPC) facilities is an important and increasingly critical capability. A growing number of scientific collaborations rely on HPC facilities for tasks which either require large-scale data sets as input or produce large-scale data sets as output. In order to enable the transfer of these data sets as needed by the scientific community, HPC facilities must design and deploy the appropriate data transfer capabilities to allow users to do data placement at scale. This paper describes the Petascale DTN Project, an effort undertaken by four HPC facilities, which succeeded in achieving routine data transfer rates of over 1PB/week between the facilities. We describe the design and configuration of the Data Transfer Node (DTN) clusters used for large-scale data transfers at these facilities, the software tools used, and the performance tuning that enabled this capability.
Current detectors for Very-High-Energy $gamma$-ray astrophysics are either pointing instruments with a small field of view (Cherenkov telescopes), or large field-of-view instruments with relatively large energy thresholds (extensive air shower detectors). In this article, we propose a new hybrid extensive air shower detector sensitive in an energy region starting from about 100 GeV. The detector combines a small water-Cherenkov detector, able to provide a calorimetric measurement of shower particles at ground, with resistive plate chambers which contribute significantly to the accurate shower geometry reconstruction. A full simulation of this detector concept shows that it is able to reach better sensitivity than any previous gamma-ray wide field-of-view experiment in the sub-TeV energy region. It is expected to detect with a $5sigma$ significance a source fainter than the Crab Nebula in one year at $100,$GeV and, above $1,$TeV a source as faint as 10% of it. As such, this instrument is suited to detect transient phenomena making it a very powerful tool to trigger observations of variable sources and to detect transients coupled to gravitational waves and gamma-ray bursts.