ﻻ يوجد ملخص باللغة العربية
Commodity memory interfaces have difficulty in scaling memory capacity to meet the needs of modern multicore and big data systems. DRAM device density and maximum device count are constrained by technology, package, and signal in- tegrity issues that limit total memory capacity. Synchronous DRAM protocols require data to be returned within a fixed latency, and thus memory extension methods over commodity DDRx interfaces fail to support scalable topologies. Current extension approaches either use slow PCIe interfaces, or require expensive changes to the memory interface, which limits commercial adoptability. Here we propose twin-load, a lightweight asynchronous memory access mechanism over the synchronous DDRx interface. Twin-load uses two special loads to accomplish one access request to extended memory, the first serves as a prefetch command to the DRAM system, and the second asynchronously gets the required data. Twin-load requires no hardware changes on the processor side and only slight soft- ware modifications. We emulate this system on a prototype to demonstrate the feasibility of our approach. Twin-load has comparable performance to NUMA extended memory and outperforms a page-swapping PCIe-based system by several orders of magnitude. Twin-load thus enables instant capacity increases on commodity platforms, but more importantly, our architecture opens opportunities for the design of novel, efficient, scalable, cost-effective memory subsystems.
While reduction in feature size makes computation cheaper in terms of latency, area, and power consumption, performance of emerging data-intensive applications is determined by data movement. These trends have introduced the concept of scalability as
Memory system is often the main bottleneck in chipmultiprocessor (CMP) systems in terms of latency, bandwidth and efficiency, and recently additionally facing capacity and power problems in an era of big data. A lot of research works have been done t
Deep Convolutional Neural Networks (CNNs) have become state-of-the art for computer vision and other signal processing tasks due to their superior accuracy. In recent years, large efforts have been made to reduce the computational costs of CNNs in or
High Bandwidth Memory (HBM) provides massive aggregated memory bandwidth by exposing multiple memory channels to the processing units. To achieve high performance, an accelerator built on top of an FPGA configured with HBM (i.e., FPGA-HBM platform) n
There are numerous approaches to building analysis applications across the high-energy physics community. Among them are Python-based, or at least Python-driven, analysis workflows. We aim to ease the adoption of a Python-based analysis toolkit by ma