Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Flare: Flexible In-Network Allreduce

92 0 0.0 ( 0 )

Download Cite

Added by Daniele De Sensi PhD

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Daniele De Sensi - Salvatore Di Girolamo - Saleh Ashkboos

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches.

rate research

Large Scale Low Power Computing System - Status of Network Design in ExaNeSt and EuroExa Projects

112 - Roberto Ammendola , Andrea Biagioni , Fabrizio Capuani 2018

The deployment of the next generation computing platform at ExaFlops scale requires to solve new technological challenges mainly related to the impressive number (up to 10^6) of compute elements required. This impacts on system power consumption, in terms of feasibility and costs, and on system scalability and computing efficiency. In this perspective analysis, exploration and evaluation of technologies characterized by low power, high efficiency and high degree of customization is strongly needed. Among the various European initiative targeting the design of ExaFlops system, ExaNeSt and EuroExa are EU-H2020 funded initiatives leveraging on high end MPSoC FPGAs. Last generation MPSoC FPGAs can be seen as non-mainstream but powerful HPC Exascale enabling components thanks to the integration of embedded multi-core, ARM-based low power CPUs and a huge number of hardware resources usable to co-design application oriented accelerators and to develop a low latency high bandwidth network architecture. In this paper we introduce ExaNet the FPGA-based, scalable, direct network architecture of ExaNeSt system. ExaNet allow us to explore different interconnection topologies, to evaluate advanced routing functions for congestion control and fault tolerance and to design specific hardware components for acceleration of collective operations. After a brief introduction of the motivations and goals of ExaNeSt and EuroExa projects, we will report on the status of network architecture design and its hardware/software testbed adding preliminary bandwidth and latency achievements.

Distributed Parallel and Cluster Computing Hardware Architecture Networking and Internet Architecture

Meta-level issues in Offloading: Scoping, Composition, Development, and their Automation

123 - Andre DeHon , Hans Giesen , Nik Sultana 2021

This paper argues for an accelerator development toolchain that takes into account the whole system containing the accelerator. With whole-system visibility, the toolchain can better assist accelerator scoping and composition in the context of the expected workloads and intended performance objectives. Despite being focused on the meta-level of accelerators, this would build on existing and ongoing DSLs and toolchains for accelerator design. Basing this on our experience in programmable networking and reconfigurable-hardware programming, we propose an integrative approach that relies on three activities: (i) generalizing the focus of acceleration to offloading to accommodate a broader variety of non-functional needs -- such as security and power use -- while using similar implementation approaches, (ii) discovering what to offload, and to what hardware, through semi-automated analysis of a whole system that might compose different offload choices that changeover time, (iii) connecting with research and state-of-the-art approaches for using domain-specific languages (DSLs) and high-level synthesis (HLS) systems for custom offload development. We outline how this integration can drive new development tooling that accepts models of programs and resources to assist system designers through design-space exploration for the accelerated system.

Distributed Parallel and Cluster Computing Hardware Architecture Networking and Internet Architecture

A Non-anchored Unified Naming System for Ad Hoc Computing Environments

89 - Yoo Chul Chung , Dongman Lee 2006

A ubiquitous computing environment consists of many resources that need to be identified by users and applications. Users and developers require some way to identify resources by human readable names. In addition, ubiquitous computing environments impose additional requirements such as the ability to work well with ad hoc situations and the provision of names that depend on context. The Non-anchored Unified Naming (NUN) system was designed to satisfy these requirements. It is based on relative naming among resources and provides the ability to name arbitrary types of resources. By having resources themselves take part in naming, resources are able to able contribute their specialized knowledge into the name resolution process, making context-dependent mapping of names to resources possible. The ease of which new resource types can be added makes it simple to incorporate new types of contextual information within names. In this paper, we describe the naming system and evaluate its use.

Distributed Parallel and Cluster Computing Hardware Architecture Networking and Internet Architecture

Optimised allgatherv, reduce_scatter and allreduce communication in message-passing systems

226 - Andreas Jocksch , Noe Ohana , Emmanuel Lanti 2020

Collective communications, namely the patterns allgatherv, reduce_scatter, and allreduce in message-passing systems are optimised based on measurements at the installation time of the library. The algorithms used are set up in an initialisation phase of the communication, similar to the method used in so-called persistent collective communication introduced in the literature. For allgatherv and reduce_scatter the existing algorithms, recursive multiply/divide and cyclic shift (Brucks algorithm) are applied with a flexible number of communication ports per node. The algorithms for equal message sizes are used with non-equal message sizes together with a heuristic for rank reordering. The two communication patterns are applied in a plasma physics application that uses a specialised matrix-vector multiplication. For the allreduce pattern the cyclic shift algorithm is applied with a prefix operation. The data is gathered and scattered by the cores within the node and the communication algorithms are applied across the nodes. In general our routines outperform the non-persistent counterparts in established MPI libraries by up to one order of magnitude or show equal performance, with a few exceptions of number of nodes and message sizes.

Distributed Parallel and Cluster Computing

Job Edge-Fog Interconnection Network Creation Game in Internet of Things

203 - Rupei Xu , Andras Farago , Jason P. Jue 2019

This is the first paper to address the topology structure of Job Edge-Fog interconnection network in the perspective of network creation game. A two level network creation game model is given, in which the first level is similar to the traditional network creation game with total length objective to other nodes. The second level adopts two types of cost functions, one is created based on the Jackson-Wolinsky type of distance based utility, another is created based on the Network-Only Cost in the IoT literature. We show the performance of this two level game (Price of Anarchy). This work discloses how the selfish strategies of each individual device can influence the global topology structure of the job edge-fog interconnection network and provides theoretical foundations of the IoT infrastructure construction. A significant advantage of this framework is that it can avoid solving the traditional expensive and impractical quadratic assignment problem, which was the typical framework to study this task. Furthermore, it can control the systematic performance based only on one or two cost parameters of the job edge-fog networks, independently and in a distributed way.

Distributed Parallel and Cluster Computing Computer Science and Game Theory Networking and Internet Architecture

comments

Fetching comments

Higher Institute for Demographic Studies and Researches

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Flare: Flexible In-Network Allreduce

Ask ChatGPT about the research

No Arabic abstract

Read More