ﻻ يوجد ملخص باللغة العربية
Putting the DRAM on the same package with a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-package DRAM is as a large cache. Unfortunately, most previous DRAM cache designs mainly optimize for hit latency and do not consider off-chip bandwidth efficiency as a first-class design constraint. Hence, as we show in this paper, these designs are suboptimal for use with in-package DRAM. We propose a new DRAM cache design, Banshee, that optimizes for both in- and off-package DRAM bandwidth efficiency without degrading access latency. The key ideas are to eliminate the in-package DRAM bandwidth overheads due to costly tag accesses through virtual memory mechanism and to incorporate a bandwidth-aware frequency-based replacement policy that is biased to reduce unnecessary traffic to off-package DRAM. Our extensive evaluation shows that Banshee provides significant performance improvement and traffic reduction over state-of-the-art latency-optimized DRAM cache designs.
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consump
DRAM Main memory is a performance bottleneck for many applications due to the high access latency. In-DRAM caches work to mitigate this latency by augmenting regular-latency DRAM with small-but-fast regions of DRAM that serve as a cache for the data
Hardware flaws are permanent and potent: hardware cannot be patched once fabricated, and any flaws may undermine any software executing on top. Consequently, verification time dominates implementation time. The gold standard in hardware Design Verifi
Tensor computations overwhelm traditional general-purpose computing devices due to the large amounts of data and operations of the computations. They call for a holistic solution composed of both hardware acceleration and software mapping. Hardware/s
Tiled spatial architectures have proved to be an effective solution to build large-scale DNN accelerators. In particular, interconnections between tiles are critical for high performance in these tile-based architectures. In this work, we identify th