Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing

73 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Kaining Zhou

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Kaining Zhou - Yangshuo He - Rui Xiao

هندسة العتاد

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The ever-increasing computation complexity of fast-growing Deep Neural Networks (DNNs) has requested new computing paradigms to overcome the memory wall in conventional Von Neumann computing architectures. The emerging Computing-In-Memory (CIM) architecture has been a promising candidate to accelerate neural network computing. However, the data movement between CIM arrays may still dominate the total power consumption in conventional designs. This paper proposes a flexible CIM processor architecture named Domino to enable stream computing and local data access to significantly reduce the data movement energy. Meanwhile, Domino employs tailored distributed instruction scheduling within Network-on-Chip (NoC) to implement inter-memory-computing and attain mapping flexibility. The evaluation with prevailing CNN models shows that Domino achieves 1.15-to-9.49$times$ power efficiency over several state-of-the-art CIM accelerators and improves the throughput by 1.57-to-12.96$times$.

قيم البحث

300 - Andrea Biagioni , Francesca Lo Cicero , Alessandro Lonardo 2012

One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communica tions between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processors tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.

هندسة العتاد بنية الشبكات والإنترنت

NOM: Network-On-Memory for Inter-Bank Data Transfer in Highly-Banked Memories

51 - Seyyed Hossein SeyyedAghaei Rezaei , Mehdi Modarressi , Rachatan Ausavarungnirun 2020

Data copy is a widely-used memory operation in many programs and operating system services. In conventional computers, data copy is often carried out by two separate read and write transactions that pass data back and forth between the DRAM chip and the processor chip. Some prior mechanisms propose to avoid this unnecessary data movement by using the shared internal bus in the DRAM chip to directly copy data within the DRAM chip (e.g., between two DRAM banks). While these methods exhibit superior performance compared to conventional techniques, data copy across different DRAM banks is still greatly slower than data copy within the same DRAM bank. Hence, these techniques have limited benefit for the emerging 3D-stacked memories (e.g., HMC and HBM) that contain hundreds of DRAM banks across multiple memory controllers. In this paper, we present Network-on-Memory (NoM), a lightweight inter-bank data communication scheme that enables direct data copy across both memory banks of a 3D-stacked memory. NoM adopts a TDM-based circuit-switching design, where circuit setup is done by the memory controller. Compared to state-of-the-art approaches, NoM enables both fast data copy between multiple DRAM banks and concurrent data transfer operations. Our evaluation shows that NoM improves the performance of data-intensive workloads by 3.8X and 75%, on average, compared to the baseline conventional 3D-stacked DRAM architecture and state-of-the-art techniques, respectively.

هندسة العتاد

Architecture, Dataflow and Physical Design Implications of 3D-ICs for DNN-Accelerators

262 - Jan Moritz Joseph , Ananda Samajdar , Lingjun Zhu 2020

The everlasting demand for higher computing power for deep neural networks (DNNs) drives the development of parallel computing architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance bec ause it introduces another level of spatial parallelism. Therefore, we analyze dataflows, performance, area, power and temperature of such 3D-DNN-accelerators. Monolithic and TSV-based stacked 3D-ICs are compared against 2D-ICs. We identify workload properties and architectural parameters for efficient 3D-ICs and achieve up to 9.14x speedup of 3D vs. 2D. We discuss area-performance trade-offs. We demonstrate applicability as the 3D-IC draws similar power as 2D-ICs and is not thermal limited.

هندسة العتاد

Archer: A Community Distributed Computing Infrastructure for Computer Architecture Research and Education

470 - Renato Figueiredo , P. Oscar Boykin , Jose A. B. Fortes 2008

This paper introduces Archer, a community-based computing resource for computer architecture research and education. The Archer infrastructure integrates virtualization and batch scheduling middleware to deliver high-throughput computing resources ag gregated from resources distributed across wide-area networks and owned by different participating entities in a seamless manner. The paper discusses the motivations leading to the design of Archer, describes its core middleware components, and presents an analysis of the functionality and performance of a prototype wide-area deployment running a representative computer architecture simulation workload.

هندسة العتاد

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

89 - Jie Zhang , Myoungsoo Jung , Mahmut Taylan Kandemir 2019

In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection networ k of GPUs multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.

هندسة العتاد

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد الوطني الجزائري للبحث الزراعي

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً