We designed a device for Convolution Neural Network applications with non-volatile MRAM memory and computing-in-memory co-designed architecture. It has been successfully fabricated using 22nm technology node CMOS Si process. More than 40MB MRAM density with 9.9TOPS/W are provided. It enables multiple models within one single chip for mobile and IoT device applications.
Many FPGAs vendors have recently included embedded processors in their devices, like Xilinx with ARM-Cortex A cores, together with programmable logic cells. These devices are known as Programmable System on Chip (PSoC). Their ARM cores (embedded in t
he processing system or PS) communicates with the programmable logic cells (PL) using ARM-standard AXI buses. In this paper we analyses the performance of exhaustive data transfers between PS and PL for a Xilinx Zynq FPGA in a co-design real scenario for Convolutional Neural Networks (CNN) accelerator, which processes, in dedicated hardware, a stream of visual information from a neuromorphic visual sensor for classification. In the PS side, a Linux operating system is running, which recollects visual events from the neuromorphic sensor into a normalized frame, and then it transfers these frames to the accelerator of multi-layered CNNs, and read results, using an AXI-DMA bus in a per-layer way. As these kind of accelerators try to process information as quick as possible, data bandwidth becomes critical and maintaining a good balanced data throughput rate requires some considerations. We present and evaluate several data partitioning techniques to improve the balance between RX and TX transfer and two different ways of transfers management: through a polling routine at the userlevel of the OS, and through a dedicated interrupt-based kernellevel driver. We demonstrate that for longer enough packets, the kernel-level driver solution gets better timing in computing a CNN classification example. Main advantage of using kernel-level driver is to have safer solutions and to have tasks scheduling in the OS to manage other important processes for our application, like frames collection from sensors and their normalization.
A novel approach is presented in this work for context-aware connectivity and processing optimization of Internet of things (IoT) networks. Different from the state-of-the-art approaches, the proposed approach simultaneously selects the best connecti
vity and processing unit (e.g., device, fog, and cloud) along with the percentage of data to be offloaded by jointly optimizing energy consumption, response-time, security, and monetary cost. The proposed scheme employs a reinforcement learning algorithm, and manages to achieve significant gains compared to deterministic solutions. In particular, the requirements of IoT devices in terms of response-time and security are taken as inputs along with the remaining battery level of the devices, and the developed algorithm returns an optimized policy. The results obtained show that only our method is able to meet the holistic multi-objective optimisation criteria, albeit, the benchmark approaches may achieve better results on a particular metric at the cost of failing to reach the other targets. Thus, the proposed approach is a device-centric and context-aware solution that accounts for the monetary and battery constraints.
Recent advances in wireless communication and solid-state circuits together with the enormous demands of sensing ability have given rise to a new enabling technology, integrated sensing and communications (ISAC). The ISAC captures two main advantages
over dedicated sensing and communication functionalities: 1) Integration gain to efficiently utilize congested resources, and even, 2) Coordination gain to balance dual-functional performance or/and perform mutual assistance. Meanwhile, triggered by ISAC, we are also witnessing a paradigm shift in the ubiquitous IoT architecture, in which the sensing and communication layers are tending to converge into a new layer, namely, the signaling layer. In this paper, we first attempt to introduce a definition of ISAC, analyze the various influencing forces, and present several novel use cases. Then, we complement the understanding of the signaling layer by presenting several key benefits in the IoT era. We classify existing dominant ISAC solutions based on the layers in which integration is applied. Finally, several challenges and opportunities are discussed. We hope that this overview article will serve as a primary starting point for new researchers and offer a birds-eye view of the existing ISAC-related advances from academia and industry, ranging from solid-state circuitry, signal processing, and wireless communication to mobile computing.
Convolutional neural network (CNN) achieves excellent performance on fascinating tasks such as image recognition and natural language processing at the cost of high power consumption. Stochastic computing (SC) is an attractive paradigm implemented in
low power applications which performs arithmetic operations with simple logic and low hardware cost. However, conventional memory structure designed and optimized for binary computing leads to extra data conversion costs, which significantly decreases the energy efficiency. Therefore, a new memory system designed for SC-based multiply-accumulate (MAC) engine applied in CNN which is compatible with conventional memory system is proposed in this paper. As a result, the overall energy consumption of our new computing structure is 0.91pJ, which is reduced by 82.1% compared with the conventional structure, and the energy efficiency achieves 164.8 TOPS/W.
Memory disaggregation has attracted great attention recently because of its benefits in efficient memory utilization and ease of management. So far, memory disaggregation research has all taken one of two approaches, building/emulating memory nodes w
ith either regular servers or raw memory devices with no processing power. The former incurs higher monetary cost and face tail latency and scalability limitations, while the latter introduce performance, security, and management problems. Server-based memory nodes and memory nodes with no processing power are two extreme approaches. We seek a sweet spot in the middle by proposing a hardware-based memory disaggregation solution that has the right amount of processing power at memory nodes. Furthermore, we take a clean-slate approach by starting from the requirements of memory disaggregation and designing a memory-disaggregation-native system. We propose a hardware-based disaggregated memory system, Clio, that virtualizes and manages disaggregated memory at the memory node. Clio includes a new hardware-based virtual memory system, a customized network system, and a framework for computation offloading. In building Clio, we not only co-design OS functionalities, hardware architecture, and the network system, but also co-design the compute node and memory node. We prototyped Clios memory node with FPGA and implemented its client-node functionalities in a user-space library. Clio achieves 100 Gbps throughput and an end-to-end latency of 2.5 us at median and 3.2 us at the 99th percentile. Clio scales much better and has orders of magnitude lower tail latency than RDMA, and it has 1.1x to 3.4x energy saving compared to CPU-based and SmartNIC-based disaggregated memory systems and is 2.7x faster than software-based SmartNIC solutions.