ترغب بنشر مسار تعليمي؟ اضغط هنا

With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near- bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging. To address these issues, we propose MPU (Memory-centric Processing Unit), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads while leveraging bank-level bandwidth, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPUs hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46x speedup and 2.57x energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.
As the key advancement of the convolutional neural networks (CNNs), depthwise separable convolutions (DSCs) are becoming one of the most popular techniques to reduce the computations and parameters size of CNNs meanwhile maintaining the model accurac y. It also brings profound impact to improve the applicability of the compute- and memory-intensive CNNs to a broad range of applications, such as mobile devices, which are generally short of computation power and memory. However, previous research in DSCs are largely focusing on compositing the limited existing DSC designs, thus, missing the opportunities to explore more potential designs that can achieve better accuracy and higher computation/parameter reduction. Besides, the off-the-shelf convolution implementations offer limited computing schemes, therefore, lacking support for DSCs with different convolution patterns. To this end, we introduce, DSXplore, the first optimized design for exploring DSCs on CNNs. Specifically, at the algorithm level, DSXplore incorporates a novel factorized kernel -- sliding-channel convolution (SCC), featured with input-channel overlapping to balance the accuracy performance and the reduction of computation and memory cost. SCC also offers enormous space for design exploration by introducing adjustable kernel parameters. Further, at the implementation level, we carry out an optimized GPU-implementation tailored for SCC by leveraging several key techniques, such as the input-centric backward design and the channel-cyclic optimization. Intensive experiments on different datasets across mainstream CNNs show the advantages of DSXplore in balancing accuracy and computation/parameter reduction over the standard convolution and the existing DSCs.
153 - Gushu Li , Yufei Ding , Yuan Xie 2019
More computational resources (i.e., more physical qubits and qubit connections) on a superconducting quantum processor not only improve the performance but also result in more complex chip architecture with lower yield rate. Optimizing both of them s imultaneously is a difficult problem due to their intrinsic trade-off. Inspired by the application-specific design principle, this paper proposes an automatic design flow to generate simplified superconducting quantum processor architecture with negligible performance loss for different quantum programs. Our architecture-design-oriented profiling method identifies program components and patterns critical to both the performance and the yield rate. A follow-up hardware design flow decomposes the complicated design procedure into three subroutines, each of which focuses on different hardware components and cooperates with corresponding profiling results and physical constraints. Experimental results show that our design methodology could outperform IBMs general-purpose design schemes with better Pareto-optimal results.
181 - Gushu Li , Yufei Ding , Yuan Xie 2019
To bridge the gap between limited hardware access and the huge demand for experiments for Noisy Intermediate-Scale Quantum (NISQ) computing system study, a simulator which can capture the modeling of both the quantum processor and its classical contr ol system to realize early-stage evaluation and design space exploration, is naturally invoked but still missing. This paper presents SANQ, a Simulation framework for Architecting NISQ computing system. SANQ consists of two components, 1) an optimized noisy quantum computing (QC) simulator with flexible error modeling accelerated by eliminating redundant computation, and 2) an architectural simulation infrastructure to construct behavior models for evaluating the control systems. SANQ is validated with existing NISQ quantum processor and control systems to ensure simulation accuracy. It can capture the variance on the QC device and simulate the timing behavior precisely (<1% and 10% error for various real control systems). Several potential applications are proposed to show that SANQ could benefit the future design of NISQ compiler, architecture, etc.
120 - Gushu Li , Yufei Ding , Yuan Xie 2018
Due to little consideration in the hardware constraints, e.g., limited connections between physical qubits to enable two-qubit gates, most quantum algorithms cannot be directly executed on the Noisy Intermediate-Scale Quantum (NISQ) devices. Dynamica lly remapping logical qubits to physical qubits in the compiler is needed to enable the two-qubit gates in the algorithm, which introduces additional operations and inevitably reduces the fidelity of the algorithm. Previous solutions in finding such remapping suffer from high complexity, poor initial mapping quality, and limited flexibility and controllability. To address these drawbacks mentioned above, this paper proposes a SWAP-based BidiREctional heuristic search algorithm SABRE, which is applicable to NISQ devices with arbitrary connections between qubits. By optimizing every search attempt,globally optimizing the initial mapping using a novel reverse traversal technique, introducing the decay effect to enable the trade-off between the depth and the number of gates of the entire algorithm, SABRE outperforms the best known algorithm with exponential speedup and comparable or better results on various benchmarks.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا