Do you want to publish a course? Click here

Edge AI without Compromise: Efficient, Versatile and Accurate Neurocomputing in Resistive Random-Access Memory

84   0   0.0 ( 0 )
 Added by Weier Wan
 Publication date 2021
and research's language is English
 Authors Weier Wan




Ask ChatGPT about the research

Realizing todays cloud-level artificial intelligence functionalities directly on devices distributed at the edge of the internet calls for edge hardware capable of processing multiple modalities of sensory data (e.g. video, audio) at unprecedented energy-efficiency. AI hardware architectures today cannot meet the demand due to a fundamental memory wall: data movement between separate compute and memory units consumes large energy and incurs long latency. Resistive random-access memory (RRAM) based compute-in-memory (CIM) architectures promise to bring orders of magnitude energy-efficiency improvement by performing computation directly within memory. However, conventional approaches to CIM hardware design limit its functional flexibility necessary for processing diverse AI workloads, and must overcome hardware imperfections that degrade inference accuracy. Such trade-offs between efficiency, versatility and accuracy cannot be addressed by isolated improvements on any single level of the design. By co-optimizing across all hierarchies of the design from algorithms and architecture to circuits and devices, we present NeuRRAM - the first multimodal edge AI chip using RRAM CIM to simultaneously deliver a high degree of versatility for diverse model architectures, record energy-efficiency $5times$ - $8times$ better than prior art across various computational bit-precisions, and inference accuracy comparable to software models with 4-bit weights on all measured standard AI benchmarks including accuracy of 99.0% on MNIST and 85.7% on CIFAR-10 image classification, 84.7% accuracy on Google speech command recognition, and a 70% reduction in image reconstruction error on a Bayesian image recovery task. This work paves a way towards building highly efficient and reconfigurable edge AI hardware platforms for the more demanding and heterogeneous AI applications of the future.



rate research

Read More

Artificial intelligence (AI) technologies have dramatically advanced in recent years, resulting in revolutionary changes in peoples lives. Empowered by edge computing, AI workloads are migrating from centralized cloud architectures to distributed edge systems, introducing a new paradigm called edge AI. While edge AI has the promise of bringing significant increases in autonomy and intelligence into everyday lives through common edge devices, it also raises new challenges, especially for the development of its algorithms and the deployment of its services, which call for novel design methodologies catered to these unique challenges. In this paper, we provide a comprehensive survey of the latest enabling design methodologies that span the entire edge AI development stack. We suggest that the key methodologies for effective edge AI development are single-layer specialization and cross-layer co-design. We discuss representative methodologies in each category in detail, including on-device training methods, specialized software design, dedicated hardware design, benchmarking and design automation, software/hardware co-design, software/compiler co-design, and compiler/hardware co-design. Moreover, we attempt to reveal hidden cross-layer design opportunities that can further boost the solution quality of future edge AI and provide insights into future directions and emerging areas that require increased research focus.
Virtual memory has been a standard hardware feature for more than three decades. At the price of increased hardware complexity, it has simplified software and promised strong isolation among colocated processes. In modern computing systems, however, the costs of virtual memory have increased significantly. With large memory workloads, virtualized environments, data center computing, and chips with multiple DMA devices, virtual memory can degrade performance and increase power usage. We therefore explore the implications of building applications and operating systems without relying on hardware support for address translation. Primarily, we investigate the implications of removing the abstraction of large contiguous memory segments. Our experiments show that the overhead to remove this reliance is surprisingly small for real programs. We expect this small overhead to be worth the benefit of reducing the complexity and energy usage of address translation. In fact, in some cases, performance can even improve when address translation is avoided.
A promising candidate for universal memory, which would involve combining the most favourable properties of both high-speed dynamic random access memory (DRAM) and non-volatile flash memory, is resistive random access memory (ReRAM). ReRAM is based on switching back and forth from a high-resistance state (HRS) to a low-resistance state (LRS). ReRAM cells are small, allowing for the creation of memory on the scale of terabits. One of the most promising materials for use as the active medium in resistive memory is hafnia (HfO$_2$). However, an unresolved physics is the nature of defects and traps that are responsible for the charge transport in HRS state of resistive memory. In this study, we demonstrated experimentally and theoretically that oxygen vacancies are responsible for the HRS charge transport in resistive memory elements based on HfO$_2$. We also demonstrated that LRS transport occurs through a mechanism described according to percolation theory. Based on the model of multiphonon tunneling between traps, and assuming that the electron traps are oxygen vacancies, good quantitative agreement between the experimental and theoretical data of current-voltage characteristics were achieved. The thermal excitation energy of the traps in hafnia was determined based on the excitation spectrum and luminescence of the oxygen vacancies. The findings of this study demonstrate that in resistive memory elements using hafnia, the oxygen vacancies in hafnia play a key role in creating defects in HRS charge transport.
Embedded non-volatile memory technologies such as resistive random access memory (RRAM) and spin-transfer torque magnetic RAM (STT MRAM) are increasingly being researched for application in neuromorphic computing and hardware accelerators for AI. However, the stochastic write processes in these memory technologies affect their yield and need to be studied alongside process variations, which drastically increase the complexity of yield analysis using the Monte Carlo approach. Therefore, we propose an approach based on the Fokker-Planck equation for modeling the stochastic write processes in STT MRAM and RRAM devices. Moreover, we show that our proposed approach can reproduce the experimental results for both STT-MRAM and RRAM devices.
Customized hardware accelerators have been developed to provide improved performance and efficiency for DNN inference and training. However, the existing hardware accelerators may not always be suitable for handling various DNN models as their architecture paradigms and configuration tradeoffs are highly application-specific. It is important to benchmark the accelerator candidates in the earliest stage to gather comprehensive performance metrics and locate the potential bottlenecks. Further demands also emerge after benchmarking, which require adequate solutions to address the bottlenecks and improve the current designs for targeted workloads. To achieve these goals, in this paper, we leverage an automation tool called DNNExplorer for benchmarking customized DNN hardware accelerators and exploring novel accelerator designs with improved performance and efficiency. Key features include (1) direct support to popular machine learning frameworks for DNN workload analysis and accurate analytical models for fast accelerator benchmarking; (2) a novel accelerator design paradigm with high-dimensional design space support and fine-grained adjustability to overcome the existing design drawbacks; and (3) a design space exploration (DSE) engine to generate optimized accelerators by considering targeted AI workloads and available hardware resources. Results show that accelerators adopting the proposed novel paradigm can deliver up to 4.2X higher throughput (GOP/s) than the state-of-the-art pipeline design in DNNBuilder and up to 2.0X improved efficiency than the recently published generic design in HybridDNN given the same DNN model and resource budgets. With DNNExplorers benchmarking and exploration features, we can be ahead at building and optimizing customized AI accelerators and enable more efficient AI applications.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا