ﻻ يوجد ملخص باللغة العربية
The use of multi-chip modules (MCM) and/or multi-socket boards is the most suitable approach to increase the computation density of servers while keep chip yield attained. This paper introduces a new coherence protocol suitable, in terms of complexity and scalability, for this class of systems. The proposal uses two complementary ideas: (1) A mechanism that dissociates complexity from performance by means of colored-token counting, (2) A construct that optimizes performance and cost by means of two functionally symmetrical modules working in the last level cache of each chip (D|F-LLC) and each memory controller (D|F-MEM). Each of these structures is divided into two parts: (2.1) The first one consists of a small loosely inclusive sparse directory where only the most actively shared data are tracked in the chip (D-LLC) from each memory controller (D-MEM) and, (2.2) The second is a d-left Counting Bloom Filter which stores approximate information about the blocks allocated, either inside the chip (F-LLC) or in the home memory controller (F-MEM). The coordinated work of both structures minimizes the coherence-related effects on the average memory latency perceived by the processor. Our proposal is able to improve on the performance of a HyperTransport-like coherence protocol by from 25%-to-60%.
We propose the first Reversible Coherence Protocol (RCP), a new protocol designed from ground up that enables invisible speculative load. RCP takes a bold approach by including the speculative loads and merge/purge operation in the interface between
Cache coherence protocols such as MESI that use writer-initiated invalidation have high complexity and sometimes have poor performance and energy usage, especially under false sharing. Such protocols require numerous transient states, a shared direct
While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the
One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communica
Hardware specialization is becoming a key enabler of energyefficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands. Traditionally, co