Rainbow: A Composable Coherence Protocol for Multi-Chip Servers

62 0 0.0 ( 0 )

Download Cite

Added by Valentin Puente

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Lucia G. Menezo - Valentin Puente -

Hardware Architecture

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The use of multi-chip modules (MCM) and/or multi-socket boards is the most suitable approach to increase the computation density of servers while keep chip yield attained. This paper introduces a new coherence protocol suitable, in terms of complexity and scalability, for this class of systems. The proposal uses two complementary ideas: (1) A mechanism that dissociates complexity from performance by means of colored-token counting, (2) A construct that optimizes performance and cost by means of two functionally symmetrical modules working in the last level cache of each chip (D|F-LLC) and each memory controller (D|F-MEM). Each of these structures is divided into two parts: (2.1) The first one consists of a small loosely inclusive sparse directory where only the most actively shared data are tracked in the chip (D-LLC) from each memory controller (D-MEM) and, (2.2) The second is a d-left Counting Bloom Filter which stores approximate information about the blocks allocated, either inside the chip (F-LLC) or in the home memory controller (F-MEM). The coordinated work of both structures minimizes the coherence-related effects on the average memory latency perceived by the processor. Our proposal is able to improve on the performance of a HyperTransport-like coherence protocol by from 25%-to-60%.

rate research

A Case for Reversible Coherence Protocol

366 - You Wu , Xuehai Qian 2020

We propose the first Reversible Coherence Protocol (RCP), a new protocol designed from ground up that enables invisible speculative load. RCP takes a bold approach by including the speculative loads and merge/purge operation in the interface between processor and cache coherence, and allowing them to participate in the coherence protocol. It means, speculative load, ordinary load/store, and merge/purge can all affect the state of a given cache line. RCP is the first coherence protocol that enables the commit and squash of the speculative load among distributed cache components in a general memory hierarchy. RCP incurs an average slowdown of (3.0%,8.3%,7.4%) on (SPEC2006,SPEC2017,PARSEC), which is lower compared to (26.5%,12%,18.3%) in InvisiSpec and (3.2%,9.4%,24.2%) in CleanupSpec. The coherence traffic overhead is on average 46%, compared to 40% and 27% of InvisiSpec and CleanupSpec, respectively. Even with higher traffic overhead (~46%), the performance overhead of RCP is lower than InvisiSpec and comparable to CleanupSpec. It reveals a key advantage of RCP: the coherence actions triggered by the merge and purge operations are not in the critical path of the execution and can be performed in the cache hierarchy concurrently with processor execution

Hardware Architecture Cryptography and Security

Neat: Low-Complexity, Efficient On-Chip Cache Coherence

82 - Rui Zhang , Swarnendu Biswas , Vignesh Balaji 2021

Cache coherence protocols such as MESI that use writer-initiated invalidation have high complexity and sometimes have poor performance and energy usage, especially under false sharing. Such protocols require numerous transient states, a shared directory, and support for core-to-core communication, while also suffering under false sharing. An alternative to MESIs writer-initiated invalidation is self-invalidation, which achieves lower complexity than MESI but adds high performance costs or relies on programmer annotations or specific data access patterns. This paper presents Neat, a low-complexity, efficient cache coherence protocol. Neat uses self-invalidation, thus avoiding MESIs transient states, directory, and core-to-core communication requirements. Neat uses novel mechanisms that effectively avoid many unnecessary self-invalidations. An evaluation shows that Neat is simple and has lower verification complexity than the MESI protocol. Neat not only outperforms state-of-the-art self-invalidation protocols, but its performance and energy consumption are comparable to MESIs, and it outperforms MESI under false sharing.

Hardware Architecture

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

135 - Saiful A. Mojumder , Yifan Sun , Leila Delshadtehrani 2020

While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the lack of hardware-level coherence, the MGPU programming model requires the programmer to replicate and repeatedly transfer data between the GPUs memory. This leads to inefficient use of precious GPU memory. Second, to maintain coherency across an MGPU system, transferring data using low-bandwidth and high-latency off-chip links leads to degradation in system performance. Third, since the programmer needs to manually maintain data coherence, the programming of an MGPU system to maximize its throughput is extremely challenging. To address the above issues, we propose a novel lightweight timestamp-based coherence protocol, HALCONE, for MGPU systems and modify the memory hierarchy of the GPUs to support physically shared memory. HALCONE replaces the Compute Unit (CU) level logical time counters with cache level logical time counters to reduce coherence traffic. Furthermore, HALCONE introduces a novel timestamp storage unit (TSU) with no additional performance overhead in the main memory to perform coherence actions. Our proposed HALCONE protocol maintains the data coherence in the memory hierarchy of the MGPU with minimal performance overhead (less than 1%). Using a set of standard MGPU benchmarks, we observe that a 4-GPU MGPU system with shared memory and HALCONE performs, on average, 4.6$times$ and 3$times$ better than a 4-GPU MGPU system with existing RDMA and with the recently proposed HMG coherence protocol, respectively. We demonstrate the scalability of HALCONE using different GPU counts (2, 4, 8, and 16) and different CU counts (32, 48, and 64 CUs per GPU) for 11 standard benchmarks.

Hardware Architecture

The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture

576 - Andrea Biagioni , Francesca Lo Cicero , Alessandro Lonardo 2012

One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processors tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.

Hardware Architecture Networking and Internet Architecture

A Case for Fine-grain Coherence Specialization in Heterogeneous Systems

70 - Johnathan Alsop , Weon Taek Na , Matthew D. Sinclair 2021

Hardware specialization is becoming a key enabler of energyefficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands. Traditionally, communication between accelerators has been inefficient, typically orchestrated through explicit DMA transfers between different address spaces. More recently, industry has proposed unified coherent memory which enables implicit data movement and more data reuse, but often these interfaces limit the coherence flexibility available to heterogeneous systems. This paper demonstrates the benefits of fine-grained coherence specialization for heterogeneous systems. We propose an architecture that enables low-complexity independent specialization of each individual coherence request in heterogeneous workloads by building upon a simple and flexible baseline coherence interface, Spandex. We then describe how to optimize individual memory requests to improve cache reuse and performance-critical memory latency in emerging heterogeneous workloads. Collectively, our techniques enable significant gains, reducing execution time by up to 61% or network traffic by up to 99% while adding minimal complexity to the Spandex protocol.

Hardware Architecture