No Arabic abstract
Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits such as high bandwidth and memory-level parallelism. However, the implications of NoCs on the characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap by (i) characterizing an HMC prototype on the AC-510 accelerator board and revealing its access latency behaviors, and (ii) by investigating the implications of such behaviors on system and software designs.
Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small area. Although several studies have taken advantage of the novel architecture of HMC, its characteristics in terms of latency and bandwidth or their correlation with temperature and power consumption have not been fully explored. This paper is the first, to the best of our knowledge, to characterize the thermal behavior of HMC in a real environment using the AC-510 accelerator and to identify temperature as a new limitation for this state-of-the-art design space. Moreover, besides bandwidth studies, we deconstruct factors that contribute to latency and reveal their sources for high- and low-load accesses. The results of this paper demonstrates essential behaviors and performance bottlenecks for future explorations of packet-switched and 3D-stacked memories.
Hybrid memory systems comprised of dynamic random access memory (DRAM) and non-volatile memory (NVM) have been proposed to exploit both the capacity advantage of NVM and the latency and dynamic energy advantages of DRAM. An important problem for such systems is how to place data between DRAM and NVM to improve system performance. In this paper, we devise the first mechanism, called UBM (page Utility Based hybrid Memory management), that systematically estimates the system performance benefit of placing a page in DRAM versus NVM and uses this estimate to guide data placement. UBMs estimation method consists of two major components. First, it estimates how much an applications stall time can be reduced if the accessed page is placed in DRAM. To do this, UBM comprehensively considers access frequency, row buffer locality, and memory level parallelism (MLP) to estimate the applications stall time reduction. Second, UBM estimates how much each applications stall time reduction contributes to overall system performance. Based on this estimation method, UBM can determine and place the most critical data in DRAM to directly optimize system performance. Experimental results show that UBM improves system performance by 14% on average (and up to 39%) compared to the best of three state-of-the-art mechanisms for a large number of data-intensive workloads from the SPEC CPU2006 and Yahoo Cloud Serving Benchmark (YCSB) suites.
We introduce ratatoskr, an open-source framework for in-depth power, performance and area (PPA) analysis in NoCs for 3D-integrated and heterogeneous System-on-Chips (SoCs). It covers all layers of abstraction by providing a NoC hardware implementation on RT level, a NoC simulator on cycle-accurate level and an application model on transaction level. By this comprehensive approach, ratatoskr can provide the following specific PPA analyses: Dynamic power of links can be measured within 2.4% accuracy of bit-level simulations while maintaining cycle-accurate simulation speed. Router power is determined from RT level synthesis combined with cycle-accurate simulations. The performance of the whole NoC can be measured both via cycle-accurate and RT level simulations. The performance of individual routers is obtained from RT level including gate-level verification. The NoC area is calculated from RT level. Despite these manifold features, ratatoskr offers easy two-step user interaction: First, a single point-of-entry that allows to set design parameters and second, PPA reports are generated automatically. For both the input and the output, different levels of abstraction can be chosen for high-level rapid network analysis or low-level improvement of architectural details. The synthesize NoC model reduces up to 32% total router power and 3% router area in comparison to a conventional standard router. As a forward-thinking and unique feature not found in other NoC PPA-measurement tools, ratatoskr supports heterogeneous 3D integration that is one of the most promising integration paradigms for upcoming SoCs. Thereby, ratatoskr lies the groundwork to design their communication architectures.
Heterogeneous 3D System-on-Chips (3D SoCs) are the most promising design paradigm to combine sensing and computing within a single chip. A special characteristic of communication networks in heterogeneous 3D SoCs is the varying latency and throughput in each layer. As shown in this work, this variance drastically degrades the network performance. We contribute a co-design of routing algorithms and router microarchitecture that allows to overcome these performance limitations. We analyze the challenges of heterogeneity: Technology-aware models are proposed for communication and thereby identify layers in which packets are transmitted slower. The communication models are precise for latency and throughput under zero load. The technology model has an area error and a timing error of less than 7.4% for various commercial technologies from 90 to 28nm. Second, we demonstrate how to overcome limitations of heterogeneity by proposing two novel routing algorithms called Z+(XY)Z- and ZXYZ that enhance latency by up to 6.5x compared to conventional dimension order routing. Furthermore, we propose a high vertical-throughput router microarchitecture that is adjusted to the routing algorithms and that fully overcomes the limitations of slower layers. We achieve an increased throughput of 2 to 4x compared to a conventional router. Thereby, the dynamic power of routers is reduced by up to 41.1% and we achieve improved flit latency of up to 2.26x at small total router area costs between 2.1% and 10.4% for realistic technologies and application scenarios.
The Von-Neumann bottleneck is a clear limitation for data-intensive applications, bringing in-memory computing (IMC) solutions to the fore. Since large data sets are usually stored in nonvolatile memory (NVM), various solutions have been proposed based on emerging memories, such as OxRAM, that rely mainly on area hungry, one transistor (1T) one OxRAM (1R) bit-cell. To tackle this area issue, while keeping the programming control provided by 1T1R bit-cell, we propose to combine gate-all-around stacked junctionless nanowires (1JL) and OxRAM (1R) technology to create a 3-D memory pillar with ultrahigh density. Nanowire junctionless transistors have been fabricated, characterized, and simulated to define current conditions for the whole pillar. Finally, based on Simulation Program with Integrated Circuit Emphasis (SPICE) simulations, we demonstrated successfully scouting logic operations up to three-pillar layers, with one operand per layer.