ترغب بنشر مسار تعليمي؟ اضغط هنا

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

90   0   0.0 ( 0 )
 نشر من قبل Myoungsoo Jung
 تاريخ النشر 2019
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPUs multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.



قيم البحث

اقرأ أيضاً

252 - K. K. Chang , D. Lee , Z. Chishti 2018
This article summarizes the idea of refresh-access parallelism, which was published in HPCA 2014, and examines the works significance and future potential. The overarching objective of our HPCA 2014 paper is to reduce the significant negative perform ance impact of DRAM refresh with intelligent memory controller mechanisms. To mitigate the negative performance impact of DRAM refresh, our HPCA 2014 paper proposes two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of state-of-the-art per-bank refresh mechanism by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as it is done today, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP proactively schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Our extensive evaluations on a wide variety of workloads and systems show that our mechanisms improve system performance (and energy efficiency) compared to three state-of-the-art refresh policies, and their performance bene ts increase as DRAM density increases.
One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communica tions between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processors tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.
This paper summarizes the idea of ChargeCache, which was published in HPCA 2016 [51], and examines the works significance and future potential. DRAM latency continues to be a critical bottleneck for system performance. In this work, we develop a low- cost mechanism, called ChargeCache, that enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems.
As a unique mechanism for MRAMs, magnetic coupling needs to be accounted for when designing memory arrays. This paper models both intra- and inter-cell magnetic coupling analytically for STT-MRAMs and investigates their impact on the write performanc e and retention of MTJ devices, which are the data-storing elements of STT-MRAMs. We present magnetic measurement data of MTJ devices with diameters ranging from 35nm to 175nm, which we use to calibrate our intra-cell magnetic coupling model. Subsequently, we extrapolate this model to study inter-cell magnetic coupling in memory arrays. We propose the inter-cell magnetic coupling factor Psi to indicate coupling strength. Our simulation results show that Psi=2% maximizes the array density under the constraint that the magnetic coupling has negligible impact on the devices performance. Higher array densities show significant variations in average switching time, especially at low switching voltages, caused by inter-cell magnetic coupling, and dependent on the data pattern in the cells neighborhood. We also observe a marginal degradation of the data retention time under the influence of inter-cell magnetic coupling.
GPUs offer orders-of-magnitude higher memory bandwidth than traditional CPU-only systems. However, GPU device memory tends to be relatively small and the memory capacity can not be increased by the user. This paper describes Buddy Compression, a sche me to increase both the effective GPU memory capacity and bandwidth while avoiding the downsides of conventional memory-expanding strategies. Buddy Compression compresses GPU memory, splitting each compressed memory entry between high-speed device memory and a slower-but-larger disaggregated memory pool (or system memory). Highly-compressible memory entries can thus be accessed completely from device memory, while incompressible entries source their data using both on and off-device accesses. Increasing the effective GPU memory capacity enables us to run larger-memory-footprint HPC workloads and larger batch-sizes or models for DL workloads than current memory capacities would allow. We show that our solution achieves an average compression ratio of 2.2x on HPC workloads and 1.5x on DL workloads, with a slowdown of just 1~2%.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا