No Arabic abstract
Heterogeneous manycore architectures are the key to efficiently execute compute- and data-intensive applications. Through silicon via (TSV)-based 3D manycore system is a promising solution in this direction as it enables integration of disparate computing cores on a single system. However, the achievable performance of conventional through-silicon-via (TSV)-based 3D systems is ultimately bottlenecked by the horizontal wires (wires in each planar die). Moreover, current TSV 3D architectures suffer from thermal limitations. Hence, TSV-based architectures do not realize the full potential of 3D integration. Monolithic 3D (M3D) integration, a breakthrough technology to achieve - More Moore and More Than Moore - and opens up the possibility of designing cores and associated network routers using multiple layers by utilizing monolithic inter-tier vias (MIVs) and hence, reducing the effective wire length. Compared to TSV-based 3D ICs, M3D offers the true benefits of vertical dimension for system integration: the size of a MIV used in M3D is over 100x smaller than a TSV. In this work, we demonstrate how M3D-enabled vertical core and uncore elements offer significant performance and thermal improvements in manycore heterogeneous architectures compared to its TSV-based counterpart. To overcome the difficult optimization challenges due to the large design space and complex interactions among the heterogeneous components (CPU, GPU, Last Level Cache, etc.) in an M3D-based manycore chip, we leverage novel design-space exploration algorithms to trade-off different objectives. The proposed M3D-enabled heterogeneous architecture, called HeM3D, outperforms its state-of-the-art TSV-equivalent counterpart by up to 18.3% in execution time while being up to 19 degrees Celcius cooler.
Manycore System-on-Chip include an increasing amount of processing elements and have become an important research topic for improvements of both hardware and software. While research can be conducted using system simulators, prototyping requires a variety of components and is very time consuming. With the Open Tiled Manycore System-on-Chip (OpTiMSoC) we aim at building such an environment for use in our and other research projects as prototyping platform. This paper describes the project goals and aspects of OpTiMSoC and summarizes the current status and ideas.
Graph Neural Network (GNN) is a variant of Deep Neural Networks (DNNs) operating on graphs. However, GNNs are more complex compared to traditional DNNs as they simultaneously exhibit features of both DNN and graph applications. As a result, architectures specifically optimized for either DNNs or graph applications are not suited for GNN training. In this work, we propose a 3D heterogeneous manycore architecture for on-chip GNN training to address this problem. The proposed architecture, ReGraphX, involves heterogeneous ReRAM crossbars to fulfill the disparate requirements of both DNN and graph computations simultaneously. The ReRAM-based architecture is complemented with a multicast-enabled 3D NoC to improve the overall achievable performance. We demonstrate that ReGraphX outperforms conventional GPUs by up to 3.5X (on an average 3X) in terms of execution time, while reducing energy consumption by as much as 11X.
Ultra-fast & low-power superconductor single-flux-quantum (SFQ)-based CNN systolic accelerators are built to enhance the CNN inference throughput. However, shift-register (SHIFT)-based scratchpad memory (SPM) arrays prevent a SFQ CNN accelerator from exceeding 40% of its peak throughput, due to the lack of random access capability. This paper first documents our study of a variety of cryogenic memory technologies, including Vortex Transition Memory (VTM), Josephson-CMOS SRAM, MRAM, and Superconducting Nanowire Memory, during which we found that none of the aforementioned technologies made a SFQ CNN accelerator achieve high throughput, small area, and low power simultaneously. Second, we present a heterogeneous SPM architecture, SMART, composed of SHIFT arrays and a random access array to improve the inference throughput of a SFQ CNN systolic accelerator. Third, we propose a fast, low-power and dense pipelined random access CMOS-SFQ array by building SFQ passive-transmission-line-based H-Trees that connect CMOS sub-banks. Finally, we create an ILP-based compiler to deploy CNN models on SMART. Experimental results show that, with the same chip area overhead, compared to the latest SHIFT-based SFQ CNN accelerator, SMART improves the inference throughput by $3.9times$ ($2.2times$), and reduces the inference energy by $86%$ ($71%$) when inferring a single image (a batch of images).
For a system-level design of Networks-on-Chip for 3D heterogeneous System-on-Chip (SoC), the locations of components, routers and vertical links are determined from an application model and technology parameters. In conventional methods, the two inputs are accounted for separately; here, we define an integrated problem that considers both application model and technology parameters. We show that this problem does not allow for exact solution in reasonable time, as common for many design problems. Therefore, we contribute a heuristic by proposing design steps, which are based on separation of intralayer and interlayer communication. The advantage is that this new problem can be solved with well-known methods. We use 3D Vision SoC case studies to quantify the advantages and the practical usability of the proposed optimization approach. We achieve up to 18.8% reduced white space and up to 12.4% better network performance in comparison to conventional approaches.
3D integration, i.e., stacking of integrated circuit layers using parallel or sequential processing is gaining rapid industry adoption with the slowdown of Moores law scaling. 3D stacking promises potential gains in performance, power and cost but the actual magnitude of gains varies depending on end-application, technology choices and design. In this talk, we will discuss some key challenges associated with 3D design and how design-for-3D will require us to break traditional silos of micro-architecture, circuit/physical design and manufacturing technology to work across abstractions to enable the gains promised by 3D technologies.