No Arabic abstract
We present a DevIce-to-System Performance EvaLuation (DISPEL) workflow that integrates transistor and interconnect modeling, parasitic extraction, standard cell library characterization, logic synthesis, cell placement and routing, and timing analysis to evaluate system-level performance of new CMOS technologies. As the impact of parasitic resistances and capacitances continues to increase with dimensional downscaling, component-level optimization alone becomes insufficient, calling for a holistic assessment and optimization methodology across the boundaries between devices, interconnects, circuits, and systems. The physical implementation flow in DISPEL enables realistic analysis of complex wires and vias in VLSI systems and their impact on the chip power, speed, and area, which simple circuit simulations cannot capture. To demonstrate the use of DISPEL, a 32-bit commercial processor core is implemented using theoretical n-type MoS2 and p-type Black Phosphorous (BP) planar FETs at a projected 5-nm node, and the performance is benchmarked against Si FinFETs. While the superior gate control of the MoS2/BP FETs can theoretically provide 51% reduction in the iso-frequency energy consumption, the actual performance can be greatly limited by the source/drain contact resistances. With the large amount of data generated by DISPEL, a neural-network is trained to predict the key performance metrics of the 32-bit processor core using the characteristics of transistors and interconnects as the input features without the need to go through the time-consuming physical implementation flow. The machine learning algorithms show great potentials as a means for evaluation and optimization of new CMOS technologies and identifying the most significant technology design parameters.
Numerous neural network circuits and architectures are presently under active research for application to artificial intelligence and machine learning. Their physical performance metrics (area, time, energy) are estimated. Various types of neural networks (artificial, cellular, spiking, and oscillator) are implemented with multiple CMOS and beyond-CMOS (spintronic, ferroelectric, resistive memory) devices. A consistent and transparent methodology is proposed and used to benchmark this comprehensive set of options across several application cases. Promising architecture/device combinations are identified.
Optical Network-on-Chip (ONoC) is an emerging technology considered as one of the key solutions for future generation on-chip interconnects. However, silicon photonic devices in ONoC are highly sensitive to temperature variation, which leads to a lower efficiency of Vertical-Cavity Surface-Emitting Lasers (VCSELs), a resonant wavelength shift of Microring Resonators (MR), and results in a lower Signal to Noise Ratio (SNR). In this paper, we propose a methodology enabling thermal-aware design for optical interconnects relying on CMOS-compatible VCSEL. Thermal simulations allow designing ONoC interfaces with low gradient temperature and analytical models allow evaluating the SNR.
We propose a technology-independent method, referred to as adjacent connection matrix (ACM), to efficiently map signed weight matrices to non-negative crossbar arrays. When compared to same-hardware-overhead mapping methods, using ACM leads to improvements of up to 20% in training accuracy for ResNet-20 with the CIFAR-10 dataset when training with 5-bit precision crossbar arrays or lower. When compared with strategies that use two elements to represent a weight, ACM achieves comparable training accuracies, while also offering area and read energy reductions of 2.3x and 7x, respectively. ACM also has a mild regularization effect that improves inference accuracy in crossbar arrays without any retraining or costly device/variation-aware training.
Low latency, high throughput inference on Convolution Neural Networks (CNNs) remains a challenge, especially for applications requiring large input or large kernel sizes. 4F optics provides a solution to accelerate CNNs by converting convolutions into Fourier-domain point-wise multiplications that are computationally free in optical domain. However, existing 4F CNN systems suffer from the all-positive sensor readout issue which makes the implementation of a multi-channel, multi-layer CNN not scalable or even impractical. In this paper we propose a simple channel tiling scheme for 4F CNN systems that utilizes the high resolution of 4F system to perform channel summation inherently in optical domain before sensor detection, so the outputs of different channels can be correctly accumulated. Compared to state of the art, channel tiling gives similar accuracy, significantly better robustness to sensing quantization (33% improvement in required sensing precision) error and noise (10dB reduction in tolerable sensing noise), 0.5X total filters required, 10-50X+ throughput improvement and as much as 3X reduction in required output camera resolution/bandwidth. Not requiring any additional optical hardware, the proposed channel tiling approach addresses an important throughput and precision bottleneck of high-speed, massively-parallel optical 4F computing systems.
Uncertainty plays a key role in real-time machine learning. As a significant shift from standard deep networks, which does not consider any uncertainty formulation during its training or inference, Bayesian deep networks are being currently investigated where the network is envisaged as an ensemble of plausible models learnt by the Bayes formulation in response to uncertainties in sensory data. Bayesian deep networks consider each synaptic weight as a sample drawn from a probability distribution with learnt mean and variance. This paper elaborates on a hardware design that exploits cycle-to-cycle variability of oxide based Resistive Random Access Memories (RRAMs) as a means to realize such a probabilistic sampling function, instead of viewing it as a disadvantage.