No Arabic abstract
This paper argues for an accelerator development toolchain that takes into account the whole system containing the accelerator. With whole-system visibility, the toolchain can better assist accelerator scoping and composition in the context of the expected workloads and intended performance objectives. Despite being focused on the meta-level of accelerators, this would build on existing and ongoing DSLs and toolchains for accelerator design. Basing this on our experience in programmable networking and reconfigurable-hardware programming, we propose an integrative approach that relies on three activities: (i) generalizing the focus of acceleration to offloading to accommodate a broader variety of non-functional needs -- such as security and power use -- while using similar implementation approaches, (ii) discovering what to offload, and to what hardware, through semi-automated analysis of a whole system that might compose different offload choices that changeover time, (iii) connecting with research and state-of-the-art approaches for using domain-specific languages (DSLs) and high-level synthesis (HLS) systems for custom offload development. We outline how this integration can drive new development tooling that accepts models of programs and resources to assist system designers through design-space exploration for the accelerated system.
The deployment of the next generation computing platform at ExaFlops scale requires to solve new technological challenges mainly related to the impressive number (up to 10^6) of compute elements required. This impacts on system power consumption, in terms of feasibility and costs, and on system scalability and computing efficiency. In this perspective analysis, exploration and evaluation of technologies characterized by low power, high efficiency and high degree of customization is strongly needed. Among the various European initiative targeting the design of ExaFlops system, ExaNeSt and EuroExa are EU-H2020 funded initiatives leveraging on high end MPSoC FPGAs. Last generation MPSoC FPGAs can be seen as non-mainstream but powerful HPC Exascale enabling components thanks to the integration of embedded multi-core, ARM-based low power CPUs and a huge number of hardware resources usable to co-design application oriented accelerators and to develop a low latency high bandwidth network architecture. In this paper we introduce ExaNet the FPGA-based, scalable, direct network architecture of ExaNeSt system. ExaNet allow us to explore different interconnection topologies, to evaluate advanced routing functions for congestion control and fault tolerance and to design specific hardware components for acceleration of collective operations. After a brief introduction of the motivations and goals of ExaNeSt and EuroExa projects, we will report on the status of network architecture design and its hardware/software testbed adding preliminary bandwidth and latency achievements.
The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches.
A ubiquitous computing environment consists of many resources that need to be identified by users and applications. Users and developers require some way to identify resources by human readable names. In addition, ubiquitous computing environments impose additional requirements such as the ability to work well with ad hoc situations and the provision of names that depend on context. The Non-anchored Unified Naming (NUN) system was designed to satisfy these requirements. It is based on relative naming among resources and provides the ability to name arbitrary types of resources. By having resources themselves take part in naming, resources are able to able contribute their specialized knowledge into the name resolution process, making context-dependent mapping of names to resources possible. The ease of which new resource types can be added makes it simple to incorporate new types of contextual information within names. In this paper, we describe the naming system and evaluate its use.
Mobile devices supporting the Internet of Things (IoT), often have limited capabilities in computation, battery energy, and storage space, especially to support resource-intensive applications involving virtual reality (VR), augmented reality (AR), multimedia delivery and artificial intelligence (AI), which could require broad bandwidth, low response latency and large computational power. Edge cloud or edge computing is an emerging topic and technology that can tackle the deficiency of the currently centralized-only cloud computing model and move the computation and storage resource closer to the devices in support of the above-mentioned applications. To make this happen, efficient coordination mechanisms and offloading algorithms are needed to allow the mobile devices and the edge cloud to work together smoothly. In this survey paper, we investigate the key issues, methods, and various state-of-the-art efforts related to the offloading problem. We adopt a new characterizing model to study the whole process of offloading from mobile devices to the edge cloud. Through comprehensive discussions, we aim to draw an overall big picture on the existing efforts and research directions. Our study also indicates that the offloading algorithms in edge cloud have demonstrated profound potentials for future technology and application development.
Machine Learning (ML) and Internet of Things (IoT) are complementary advances: ML techniques unlock complete potentials of IoT with intelligence, and IoT applications increasingly feed data collected by sensors into ML models, thereby employing results to improve their business processes and services. Hence, orchestrating ML pipelines that encompasses model training and implication involved in holistic development lifecycle of an IoT application often leads to complex system integration. This paper provides a comprehensive and systematic survey on the development lifecycle of ML-based IoT application. We outline core roadmap and taxonomy, and subsequently assess and compare existing standard techniques used in individual stage.