No Arabic abstract
Truly polymorphic circuits, whose functionality/circuit behavior can be altered using a control variable, can provide tremendous benefits in multi-functional system design and resource sharing. For secure and fault tolerant hardware designs these can be crucial as well. Polymorphic circuits work in literature so far either rely on environmental parameters such as temperature, variation etc. or on special devices such as ambipolar FET, configurable magnetic devices, etc., that often result in inefficiencies in performance and/or realization. In this paper, we introduce a novel polymorphic circuit design approach where deterministic interference between nano-metal lines is leveraged for logic computing and configuration. For computing, the proposed approach relies on nano-metal lines, their interference and commonly used FETs, and for polymorphism, it requires only an extra metal line that carries the control signal. In this paper, we show a wide range of crosstalk polymorphic (CT-P) logic gates and their evaluation results. We also show an example of a large circuit that performs both the functionalities of multiplier and sorter depending on the configuration signal. Our benchmarking results are presented in this paper. For CT-P, the transistor count was found to be significantly less compared to other existing approaches, ranging from 25% to 83%. For example, CT-P AOI21-OA21 cell show 83%, 85% and 50% transistor count reduction, and MultiplierSorter circuit show 40%, 36% and 28% transistor count reduction with respect to CMOS, genetically evolved, and ambipolar transistor based polymorphic circuits respectively.
When a computational task tolerates a relaxation of its specification or when an algorithm tolerates the effects of noise in its execution, hardware, programming languages, and system software can trade deviations from correct behavior for lower resource usage. We present, for the first time, a synthesis of research results on computing systems that only make as many errors as their users can tolerate, from across the disciplines of computer aided design of circuits, digital system design, computer architecture, programming languages, operating systems, and information theory. Rather than over-provisioning resources at each layer to avoid errors, it can be more efficient to exploit the masking of errors occurring at one layer which can prevent them from propagating to a higher layer. We survey tradeoffs for individual layers of computing systems from the circuit level to the operating system level and illustrate the potential benefits of end-to-end approaches using two illustrative examples. To tie together the survey, we present a consistent formalization of terminology, across the layers, which does not significantly deviate from the terminology traditionally used by research communities in their layer of focus.
We study the problem of sparse nonlinear model recovery of high dimensional compositional functions. Our study is motivated by emerging opportunities in neuroscience to recover fine-grained models of biological neural circuits using collected measurement data. Guided by available domain knowledge in neuroscience, we explore conditions under which one can recover the underlying biological circuit that generated the training data. Our results suggest insights of both theoretical and practical interests. Most notably, we find that a sign constraint on the weights is a necessary condition for system recovery, which we establish both theoretically with an identifiability guarantee and empirically on simulated biological circuits. We conclude with a case study on retinal ganglion cell circuits using data collected from mouse retina, showcasing the practical potential of this approach.
Ultra-fast & low-power superconductor single-flux-quantum (SFQ)-based CNN systolic accelerators are built to enhance the CNN inference throughput. However, shift-register (SHIFT)-based scratchpad memory (SPM) arrays prevent a SFQ CNN accelerator from exceeding 40% of its peak throughput, due to the lack of random access capability. This paper first documents our study of a variety of cryogenic memory technologies, including Vortex Transition Memory (VTM), Josephson-CMOS SRAM, MRAM, and Superconducting Nanowire Memory, during which we found that none of the aforementioned technologies made a SFQ CNN accelerator achieve high throughput, small area, and low power simultaneously. Second, we present a heterogeneous SPM architecture, SMART, composed of SHIFT arrays and a random access array to improve the inference throughput of a SFQ CNN systolic accelerator. Third, we propose a fast, low-power and dense pipelined random access CMOS-SFQ array by building SFQ passive-transmission-line-based H-Trees that connect CMOS sub-banks. Finally, we create an ILP-based compiler to deploy CNN models on SMART. Experimental results show that, with the same chip area overhead, compared to the latest SHIFT-based SFQ CNN accelerator, SMART improves the inference throughput by $3.9times$ ($2.2times$), and reduces the inference energy by $86%$ ($71%$) when inferring a single image (a batch of images).
Maintaining and updating shortest paths information in a graph is a fundamental problem with many applications. As computations on dense graphs can be prohibitively expensive, and it is preferable to perform the computations on a sparse skeleton of the given graph that roughly preserves the shortest paths information. Spanners and emulators serve this purpose. This paper develops fast dynamic algorithms for sparse spanner and emulator maintenance and provides evidence from fine-grained complexity that these algorithms are tight. Under the popular OMv conjecture, we show that there can be no decremental or incremental algorithm that maintains an $n^{1+o(1)}$ edge (purely additive) $+n^{delta}$-emulator for any $delta<1/2$ with arbitrary polynomial preprocessing time and total update time $m^{1+o(1)}$. Also, under the Combinatorial $k$-Clique hypothesis, any fully dynamic combinatorial algorithm that maintains an $n^{1+o(1)}$ edge $(1+epsilon,n^{o(1)})$-spanner or emulator must either have preprocessing time $mn^{1-o(1)}$ or amortized update time $m^{1-o(1)}$. Both of our conditional lower bounds are tight. As the above fully dynamic lower bound only applies to combinatorial algorithms, we also develop an algebraic spanner algorithm that improves over the $m^{1-o(1)}$ update time for dense graphs. For any constant $epsilonin (0,1]$, there is a fully dynamic algorithm with worst-case update time $O(n^{1.529})$ that whp maintains an $n^{1+o(1)}$ edge $(1+epsilon,n^{o(1)})$-spanner. Our new algebraic techniques and spanner algorithms allow us to also obtain (1) a new fully dynamic algorithm for All-Pairs Shortest Paths (APSP) with update and path query time $O(n^{1.9})$; (2) a fully dynamic $(1+epsilon)$-approximate APSP algorithm with update time $O(n^{1.529})$; (3) a fully dynamic algorithm for near-$2$-approximate Steiner tree maintenance.
DRAM Main memory is a performance bottleneck for many applications due to the high access latency. In-DRAM caches work to mitigate this latency by augmenting regular-latency DRAM with small-but-fast regions of DRAM that serve as a cache for the data held in the regular-latency region of DRAM. While an effective in-DRAM cache can allow a large fraction of memory requests to be served from a fast DRAM region, the latency savings are often hindered by inefficient mechanisms for relocating copies of data into and out of the fast regions. Existing in-DRAM caches have two sources of inefficiency: (1) the data relocation granularity is an entire multi-kilobyte row of DRAM; and (2) because the relocation latency increases with the physical distance between the slow and fast regions, multiple fast regions are physically interleaved among slow regions to reduce the relocation latency, resulting in increased hardware area and manufacturing complexity. We propose a new substrate, FIGARO, that uses existing shared global buffers among subarrays within a DRAM bank to provide support for in-DRAM data relocation across subarrays at the granularity of a single cache block. FIGARO has a distance-independent latency within a DRAM bank, and avoids complex modifications to DRAM. Using FIGARO, we design a fine-grained in-DRAM cache called FIGCache. The key idea of FIGCache is to cache only small, frequently-accessed portions of different DRAM rows in a designated region of DRAM. By caching only the parts of each row that are expected to be accessed in the near future, we can pack more of the frequently-accessed data into FIGCache, and can benefit from additional row hits in DRAM. Our evaluations show that FIGCache improves the average performance of a system using DDR4 DRAM by 16.3% and reduces average DRAM energy consumption by 7.8% for 8-core workloads, over a conventional system without in-DRAM caching.