Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

LightOn Optical Processing Unit: Scaling-up AI and HPC with a Non von Neumann co-processor

76 0 0.0 ( 0 )

Download Cite

Added by Laurent Daudet

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Charles Brossollet - Alessandro Cappelli - Igor Carron

Hardware Architecture Emerging Technologies

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We introduce LightOns Optical Processing Unit (OPU), the first photonic AI accelerator chip available on the market for at-scale Non von Neumann computations, reaching 1500 TeraOPS. It relies on a combination of free-space optics with off-the-shelf components, together with a software API allowing a seamless integration within Python-based processing pipelines. We discuss a variety of use cases and hybrid network architectures, with the OPU used in combination of CPU/GPU, and draw a pathway towards optical advantage.

rate research

In-Datacenter Performance Analysis of a Tensor Processing Unit

152 - Norman P. Jouppi , Cliff Young , Nishant Patil 2017

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPUs deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPUs GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Hardware Architecture Machine Learning Neural and Evolutionary Computing

Trace- and improved data processing inequalities for von Neumann algebras

114 - Stefan Hollands 2021

We prove a version of the data-processing inequality for the relative entropy for general von Neumann algebras with an explicit lower bound involving the measured relative entropy. The inequality, which generalizes previous work by Sutter et al. on finite dimensional density matrices, yields a bound how well a quantum state can be recovered after it has been passed through a channel. The natural applications of our results are in quantum field theory where the von Neumann algebras are known to be of type III. Along the way we generalize various multi-trace inequalities to general von Neumann algebras.

Mathematical Physics High Energy Physics - Theory Mathematical Physics

A Novel Low Power Non-Volatile SRAM Cell with Self Write Termination

376 - Kanika Monga , Akul Malhotra , Nitin Chaturvedi 2019

A non-volatile SRAM cell is proposed for low power applications using Spin Transfer Torque-Magnetic Tunnel Junction (STT-MTJ) devices. This novel cell offers non-volatile storage, thus allowing selected blocks of SRAM to be switched off during standby operation. To further increase the power savings, a write termination circuit is designed which detects completion of MTJ write and closes the bidirectional current path for the MTJ. A reduction of 25.81% in the number of transistors and a reduction of 2.95% in the power consumption is achieved in comparison to prior work on write termination circuits.

Hardware Architecture Emerging Technologies

Helix: Algorithm/Architecture Co-design for Accelerating Nanopore Genome Base-calling

85 - Qian Lou , Sarath Janga , Lei Jiang 2020

Nanopore genome sequencing is the key to enabling personalized medicine, global food security, and virus surveillance. The state-of-the-art base-callers adopt deep neural networks (DNNs) to translate electrical signals generated by nanopore sequencers to digital DNA symbols. A DNN-based base-caller consumes $44.5%$ of total execution time of a nanopore sequencing pipeline. However, it is difficult to quantize a base-caller and build a power-efficient processing-in-memory (PIM) to run the quantized base-caller. In this paper, we propose a novel algorithm/architecture co-designed PIM, Helix, to power-efficiently and accurately accelerate nanopore base-calling. From algorithm perspective, we present systematic error aware training to minimize the number of systematic errors in a quantized base-caller. From architecture perspective, we propose a low-power SOT-MRAM-based ADC array to process analog-to-digital conversion operations and improve power efficiency of prior DNN PIMs. Moreover, we revised a traditional NVM-based dot-product engine to accelerate CTC decoding operations, and create a SOT-MRAM binary comparator array to process read voting. Compared to state-of-the-art PIMs, Helix improves base-calling throughput by $6times$, throughput per Watt by $11.9times$ and per $mm^2$ by $7.5times$ without degrading base-calling accuracy.

Hardware Architecture Emerging Technologies Machine Learning

Enabling Design Methodologies and Future Trends for Edge AI: Specialization and Co-design

326 - Cong Hao , Jordan Dotzel , Jinjun Xiong 2021

Artificial intelligence (AI) technologies have dramatically advanced in recent years, resulting in revolutionary changes in peoples lives. Empowered by edge computing, AI workloads are migrating from centralized cloud architectures to distributed edge systems, introducing a new paradigm called edge AI. While edge AI has the promise of bringing significant increases in autonomy and intelligence into everyday lives through common edge devices, it also raises new challenges, especially for the development of its algorithms and the deployment of its services, which call for novel design methodologies catered to these unique challenges. In this paper, we provide a comprehensive survey of the latest enabling design methodologies that span the entire edge AI development stack. We suggest that the key methodologies for effective edge AI development are single-layer specialization and cross-layer co-design. We discuss representative methodologies in each category in detail, including on-device training methods, specialized software design, dedicated hardware design, benchmarking and design automation, software/hardware co-design, software/compiler co-design, and compiler/hardware co-design. Moreover, we attempt to reveal hidden cross-layer design opportunities that can further boost the solution quality of future edge AI and provide insights into future directions and emerging areas that require increased research focus.

Hardware Architecture Artificial Intelligence

comments

Fetching comments

Syrian Virtual University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

LightOn Optical Processing Unit: Scaling-up AI and HPC with a Non von Neumann co-processor

Ask ChatGPT about the research

No Arabic abstract

Read More