بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

MRAM Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications

123 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Baohua Sun

تاريخ النشر 2018

مجال البحث هندسة إلكترونية

والبحث باللغة English

تأليف Baohua Sun - Daniel Liu - Leo Yu

معالجة الإشارات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We designed a device for Convolution Neural Network applications with non-volatile MRAM memory and computing-in-memory co-designed architecture. It has been successfully fabricated using 22nm technology node CMOS Si process. More than 40MB MRAM density with 9.9TOPS/W are provided. It enables multiple models within one single chip for mobile and IoT device applications.

قيم البحث

74 - A. Rios-Navarro , R. Tapiador-Morales , A. Jimenez-Fernandez 2018

Many FPGAs vendors have recently included embedded processors in their devices, like Xilinx with ARM-Cortex A cores, together with programmable logic cells. These devices are known as Programmable System on Chip (PSoC). Their ARM cores (embedded in t he processing system or PS) communicates with the programmable logic cells (PL) using ARM-standard AXI buses. In this paper we analyses the performance of exhaustive data transfers between PS and PL for a Xilinx Zynq FPGA in a co-design real scenario for Convolutional Neural Networks (CNN) accelerator, which processes, in dedicated hardware, a stream of visual information from a neuromorphic visual sensor for classification. In the PS side, a Linux operating system is running, which recollects visual events from the neuromorphic sensor into a normalized frame, and then it transfers these frames to the accelerator of multi-layered CNNs, and read results, using an AXI-DMA bus in a per-layer way. As these kind of accelerators try to process information as quick as possible, data bandwidth becomes critical and maintaining a good balanced data throughput rate requires some considerations. We present and evaluate several data partitioning techniques to improve the balance between RX and TX transfer and two different ways of transfers management: through a polling routine at the userlevel of the OS, and through a dedicated interrupt-based kernellevel driver. We demonstrate that for longer enough packets, the kernel-level driver solution gets better timing in computing a CNN classification example. Main advantage of using kernel-level driver is to have safer solutions and to have tasks scheduling in the OS to manage other important processes for our application, like frames collection from sensors and their normalization.

النظم الموزعة والتوازية والحوسبة العنقودية

Context-Aware Wireless Connectivity and Processing Unit Optimization for IoT Networks

81 - Metin Ozturk , Attai Ibrahim Abubakar , Rao Naveed Bin Rais 2020

A novel approach is presented in this work for context-aware connectivity and processing optimization of Internet of things (IoT) networks. Different from the state-of-the-art approaches, the proposed approach simultaneously selects the best connecti vity and processing unit (e.g., device, fog, and cloud) along with the percentage of data to be offloaded by jointly optimizing energy consumption, response-time, security, and monetary cost. The proposed scheme employs a reinforcement learning algorithm, and manages to achieve significant gains compared to deterministic solutions. In particular, the requirements of IoT devices in terms of response-time and security are taken as inputs along with the remaining battery level of the devices, and the developed algorithm returns an optimized policy. The results obtained show that only our method is able to meet the holistic multi-objective optimisation criteria, albeit, the benchmark approaches may achieve better results on a particular metric at the cost of failing to reach the other targets. Thus, the proposed approach is a device-centric and context-aware solution that accounts for the monetary and battery constraints.

معالجة الإشارات التعلم الآلي أنظمة وتحكم

Integrating Sensing and Communications for Ubiquitous IoT: Applications, Trends and Challenges

86 - Yuanhao Cui , Fan Liu , Xiaojun Jing 2021

Recent advances in wireless communication and solid-state circuits together with the enormous demands of sensing ability have given rise to a new enabling technology, integrated sensing and communications (ISAC). The ISAC captures two main advantages over dedicated sensing and communication functionalities: 1) Integration gain to efficiently utilize congested resources, and even, 2) Coordination gain to balance dual-functional performance or/and perform mutual assistance. Meanwhile, triggered by ISAC, we are also witnessing a paradigm shift in the ubiquitous IoT architecture, in which the sensing and communication layers are tending to converge into a new layer, namely, the signaling layer. In this paper, we first attempt to introduce a definition of ISAC, analyze the various influencing forces, and present several novel use cases. Then, we complement the understanding of the signaling layer by presenting several key benefits in the IoT era. We classify existing dominant ISAC solutions based on the layers in which integration is applied. Finally, several challenges and opportunities are discussed. We hope that this overview article will serve as a primary starting point for new researchers and offer a birds-eye view of the existing ISAC-related advances from academia and industry, ranging from solid-state circuitry, signal processing, and wireless communication to mobile computing.

معالجة الإشارات

Memory System Designed for Multiply-Accumulate (MAC) Engine Based on Stochastic Computing

100 - Xinyue Zhang , Yuan Wang , Yawen Zhang 2019

Convolutional neural network (CNN) achieves excellent performance on fascinating tasks such as image recognition and natural language processing at the cost of high power consumption. Stochastic computing (SC) is an attractive paradigm implemented in low power applications which performs arithmetic operations with simple logic and low hardware cost. However, conventional memory structure designed and optimized for binary computing leads to extra data conversion costs, which significantly decreases the energy efficiency. Therefore, a new memory system designed for SC-based multiply-accumulate (MAC) engine applied in CNN which is compatible with conventional memory system is proposed in this paper. As a result, the overall energy consumption of our new computing structure is 0.91pJ, which is reduced by 82.1% compared with the conventional structure, and the energy efficiency achieves 164.8 TOPS/W.

معالجة الإشارات

Clio: A Hardware-Software Co-Designed Disaggregated Memory System

164 - Zhiyuan Guo , Yizhou Shan , Xuhao Luo 2021

Memory disaggregation has attracted great attention recently because of its benefits in efficient memory utilization and ease of management. So far, memory disaggregation research has all taken one of two approaches, building/emulating memory nodes w ith either regular servers or raw memory devices with no processing power. The former incurs higher monetary cost and face tail latency and scalability limitations, while the latter introduce performance, security, and management problems. Server-based memory nodes and memory nodes with no processing power are two extreme approaches. We seek a sweet spot in the middle by proposing a hardware-based memory disaggregation solution that has the right amount of processing power at memory nodes. Furthermore, we take a clean-slate approach by starting from the requirements of memory disaggregation and designing a memory-disaggregation-native system. We propose a hardware-based disaggregated memory system, Clio, that virtualizes and manages disaggregated memory at the memory node. Clio includes a new hardware-based virtual memory system, a customized network system, and a framework for computation offloading. In building Clio, we not only co-design OS functionalities, hardware architecture, and the network system, but also co-design the compute node and memory node. We prototyped Clios memory node with FPGA and implemented its client-node functionalities in a user-space library. Clio achieves 100 Gbps throughput and an end-to-end latency of 2.5 us at median and 3.2 us at the 99th percentile. Clio scales much better and has orders of magnitude lower tail latency than RDMA, and it has 1.1x to 3.4x energy saving compared to CPU-based and SmartNIC-based disaggregated memory systems and is 2.7x faster than software-based SmartNIC solutions.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة المستنصرية

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

MRAM Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً