ترغب بنشر مسار تعليمي؟ اضغط هنا

Image processing and machine learning applications benefit tremendously from hardware acceleration, but existing compilers target either FPGAs, which sacrifice power and performance for flexible hardware, or ASICs, which rapidly become obsolete as ap plications change. Programmable domain-specific accelerators have emerged as a promising middle-ground between these two extremes, but such architectures have traditionally been difficult compiler targets. The main obstacle is that these accelerators often use a different memory abstraction than CPUs and GPUs: push memories that send a data stream from one computation kernel to other kernels, possibly reordered. To address the compilation challenges caused by push memories, we propose that the representation of memory in the middle and backend of the compiler be altered to combine storage with address generation and control logic in a single structure -- a unified buffer. We show that this compiler abstraction can be implemented efficiently on a programmable accelerator, and design a memory mapping algorithm that combines polyhedral analysis and software vectorization techniques to target our accelerator. Our evaluation shows that the compiler supports programmability while maintaining high performance. It can compile a wide range of image processing and machine learning applications to our accelerator with 4.7x better runtime and 4.3x better energy-efficiency as compared to an FPGA.
44 - Zijian Wang , Qiaoyi Li , Wei Li 2021
Symmetry-protected topological edge modes are one of the most remarkable phenomena in topological physics. Here, we formulate and quantitatively examine the effect of a quantum bath on these topological edge modes. Using the density matrix renormaliz ation group method, we study the ground state of a composite system of spin-1 quantum chain, where the system and the bath degrees of freedom are treated on the same footing. We focus on the dependence of these edge modes on the global/partial symmetries of system-bath coupling and on the features of the quantum bath. It is shown that the time-reversal symmetry(TRS) plays a special role for an open quantum system, where an emergent partial TRS breaking will result in a TRS-protected topological mode diffusing from the system edge into the bath, thus make it useless for quantum computation.
We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing d ense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halides scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا