Ookami: Deployment and Initial Experiences

73 0 0.0 ( 0 )

Download Cite

Added by Eva Siegmann

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Andrew Burford - Alan C. Calder - David Carlson

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Ookami is a computer technology testbed supported by the United States National Science Foundation. It provides researchers with access to the A64FX processor developed by Fujitsu in collaboration with RIK{Xi}N for the Japanese path to exascale computing, as deployed in Fugaku, the fastest computer in the world. By focusing on crucial architectural details, the ARM-based, multi-core, 512-bit SIMD-vector processor with ultrahigh-bandwidth memory promises to retain familiar and successful programming models while achieving very high performance for a wide range of applications. We review relevant technology and system details, and the main body of the paper focuses on initial experiences with the hardware and software ecosystem for micro-benchmarks, mini-apps, and full applications, and starts to answer questions about where such technologies fit into the NSF ecosystem.

rate research

LPWAN in the TV White Spaces: A Practical Implementation and Deployment Experiences

97 - Mahbubur Rahman , Dali Ismail , Venkata P Modekurthy 2021

Low-Power Wide-Area Network (LPWAN) is an enabling Internet-of-Things (IoT) technology that supports long-range, low-power, and low-cost connectivity to numerous devices. To avoid the crowd in the limited ISM band (where most LPWANs operate) and cost of licensed band, the recently proposed SNOW (Sensor Network over White Spaces) is a promising LPWAN platform that operates over the TV white spaces. As it is a very recent technology and is still in its infancy, the current SNOW implementation uses the USRP devices as LPWAN nodes, which has high costs (~$750 USD per device) and large form-factors, hindering its applicability in practical deployment. In this paper, we implement SNOW using low-cost, low form-factor, low-power, and widely available commercial off-the-shelf (COTS) devices to enable its practical and large-scale deployment. Our choice of the COTS device (TI CC13x0: CC1310 or CC1350) consequently brings down the cost and form-factor of a SNOW node by 25x and 10x, respectively. Such implementation of SNOW on the CC13x0 devices, however, faces a number of challenges to enable link reliability and communication range. Our implementation addresses these challenges by handling peak-to-average power ratio problem, channel state information estimation, carrier frequency offset estimation, and near-far power problem. Our deployment in the city of Detroit, Michigan demonstrates that CC13x0-based SNOW can achieve uplink and downlink throughputs of 11.2kbps and 4.8kbps per node, respectively, over a distance of 1km. Also, the overall throughput in the uplink increases linearly with the increase in the number of SNOW nodes.

Networking and Internet Architecture

R-GMA: First results after deployment

69 - Rob Byrom , Brian Coghlan , Andrew W Cooke 2003

We describe R-GMA (Relational Grid Monitoring Architecture) which is being developed within the European DataGrid Project as an Grid Information and Monitoring System. Is is based on the GMA from GGF, which is a simple Consumer-Producer model. The special strength of this implementation comes from the power of the relational model. We offer a global view of the information as if each VO had one large relational database. We provide a number of different Producer types with different characteristics; for example some support streaming of information. We also provide combined Consumer/Producers, which are able to combine information and republish it. At the heart of the system is the mediator, which for any query is able to find and connect to the best Producers to do the job. We are able to invoke MDS info-provider scripts and publish the resulting information via R-GMA in addition to having some of our own sensors. APIs are available which allow the user to deploy monitoring and information services for any application that may be needed in the future. We have used it both for information about the grid (primarily to find what services are available at any one time) and for application monitoring. R-GMA has been deployed in Grid testbeds, we describe the results and experiences of this deployment.

Distributed Parallel and Cluster Computing

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

443 - Shen Li , Yanli Zhao , Rohan Varma 2020

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

Distributed Parallel and Cluster Computing Machine Learning

Honing and proofing Astrophysical codes on the road to Exascale. Experiences from code modernization on many-core systems

108 - Salvatore Cielo , Luigi Iapichino , Fabio Baruffa 2020

The complexity of modern and upcoming computing architectures poses severe challenges for code developers and application specialists, and forces them to expose the highest possible degree of parallelism, in order to make the best use of the available hardware. The Intel$^{(R)}$ Xeon Phi$^{(TM)}$ of second generation (code-named Knights Landing, henceforth KNL) is the latest many-core system, which implements several interesting hardware features like for example a large number of cores per node (up to 72), the 512 bits-wide vector registers and the high-bandwidth memory. The unique features of KNL make this platform a powerful testbed for modern HPC applications. The performance of codes on KNL is therefore a useful proxy of their readiness for future architectures. In this work we describe the lessons learnt during the optimisation of the widely used codes for computational astrophysics P-Gadget-3, Flash and Echo. Moreover, we present results for the visualisation and analysis tools VisIt and yt. These examples show that modern architectures benefit from code optimisation at different levels, even more than traditional multi-core systems. However, the level of modernisation of typical community codes still needs improvements, for them to fully utilise resources of novel architectures.

Distributed Parallel and Cluster Computing Instrumentation and Methods for Astrophysics Performance

The first deployment of workload management services on the EU DataGrid Testbed: feedback on design and implementation

91 - G. Avellino , S. Beco , B. Cantalupo 2003

Application users have now been experiencing for about a year with the standardized resource brokering services provided by the workload management package of the EU DataGrid project (WP1). Understanding, shaping and pushing the limits of the system has provided valuable feedback on both its design and implementation. A digest of the lessons, and better practices, that were learned, and that were applied towards the second major release of the software, is given.

Distributed Parallel and Cluster Computing