ﻻ يوجد ملخص باللغة العربية
The ever-increasing computation complexity of fast-growing Deep Neural Networks (DNNs) has requested new computing paradigms to overcome the memory wall in conventional Von Neumann computing architectures. The emerging Computing-In-Memory (CIM) architecture has been a promising candidate to accelerate neural network computing. However, the data movement between CIM arrays may still dominate the total power consumption in conventional designs. This paper proposes a flexible CIM processor architecture named Domino to enable stream computing and local data access to significantly reduce the data movement energy. Meanwhile, Domino employs tailored distributed instruction scheduling within Network-on-Chip (NoC) to implement inter-memory-computing and attain mapping flexibility. The evaluation with prevailing CNN models shows that Domino achieves 1.15-to-9.49$times$ power efficiency over several state-of-the-art CIM accelerators and improves the throughput by 1.57-to-12.96$times$.
One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communica
Data copy is a widely-used memory operation in many programs and operating system services. In conventional computers, data copy is often carried out by two separate read and write transactions that pass data back and forth between the DRAM chip and
The everlasting demand for higher computing power for deep neural networks (DNNs) drives the development of parallel computing architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance bec
This paper introduces Archer, a community-based computing resource for computer architecture research and education. The Archer infrastructure integrates virtualization and batch scheduling middleware to deliver high-throughput computing resources ag
In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection networ