ترغب بنشر مسار تعليمي؟ اضغط هنا

F-CAD: A Framework to Explore Hardware Accelerators for Codec Avatar Decoding

106   0   0.0 ( 0 )
 نشر من قبل Xiaofan Zhang
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Creating virtual avatars with realistic rendering is one of the most essential and challenging tasks to provide highly immersive virtual reality (VR) experiences. It requires not only sophisticated deep neural network (DNN) based codec avatar decoders to ensure high visual quality and precise motion expression, but also efficient hardware accelerators to guarantee smooth real-time rendering using lightweight edge devices, like untethered VR headsets. Existing hardware accelerators, however, fail to deliver sufficient performance and efficiency targeting such decoders which consist of multi-branch DNNs and require demanding compute and memory resources. To address these problems, we propose an automation framework, called F-CAD (Facebook Codec avatar Accelerator Design), to explore and deliver optimized hardware accelerators for codec avatar decoding. Novel technologies include 1) a new accelerator architecture to efficiently handle multi-branch DNNs; 2) a multi-branch dynamic design space to enable fine-grained architecture configurations; and 3) an efficient architecture search for picking the optimized hardware design based on both application-specific demands and hardware resource constraints. To the best of our knowledge, F-CAD is the first automation tool that supports the whole design flow of hardware acceleration of codec avatar decoders, allowing joint optimization on decoder designs in popular machine learning frameworks and corresponding customized accelerator design with cycle-accurate evaluation. Results show that the accelerators generated by F-CAD can deliver up to 122.1 frames per second (FPS) and 91.6% hardware efficiency when running the latest codec avatar decoder. Compared to the state-of-the-art designs, F-CAD achieves 4.0X and 2.8X higher throughput, 62.5% and 21.2% higher efficiency than DNNBuilder and HybridDNN by targeting the same hardware device.



قيم البحث

اقرأ أيضاً

Tiled spatial architectures have proved to be an effective solution to build large-scale DNN accelerators. In particular, interconnections between tiles are critical for high performance in these tile-based architectures. In this work, we identify th e inefficiency of the widely used traditional on-chip networks and the opportunity of software-hardware co-design. We propose METRO with the basic idea of decoupling the traffic scheduling policies from hardware fabrics and moving them to the software level. METRO contains two modules working in synergy: METRO software scheduling framework to coordinate the traffics and METRO hardware facilities to deliver the data based on software configurations. We evaluate the co-design using different flit sizes for synthetic study, illustrating its effectiveness under various hardware resource constraints, in addition to a wide range of DNN models selected from real-world workloads. The results show that METRO achieves 56.3% communication speedup on average and up to 73.6% overall processing time reduction compared with traditional on-chip network designs.
Customized hardware accelerators have been developed to provide improved performance and efficiency for DNN inference and training. However, the existing hardware accelerators may not always be suitable for handling various DNN models as their archit ecture paradigms and configuration tradeoffs are highly application-specific. It is important to benchmark the accelerator candidates in the earliest stage to gather comprehensive performance metrics and locate the potential bottlenecks. Further demands also emerge after benchmarking, which require adequate solutions to address the bottlenecks and improve the current designs for targeted workloads. To achieve these goals, in this paper, we leverage an automation tool called DNNExplorer for benchmarking customized DNN hardware accelerators and exploring novel accelerator designs with improved performance and efficiency. Key features include (1) direct support to popular machine learning frameworks for DNN workload analysis and accurate analytical models for fast accelerator benchmarking; (2) a novel accelerator design paradigm with high-dimensional design space support and fine-grained adjustability to overcome the existing design drawbacks; and (3) a design space exploration (DSE) engine to generate optimized accelerators by considering targeted AI workloads and available hardware resources. Results show that accelerators adopting the proposed novel paradigm can deliver up to 4.2X higher throughput (GOP/s) than the state-of-the-art pipeline design in DNNBuilder and up to 2.0X improved efficiency than the recently published generic design in HybridDNN given the same DNN model and resource budgets. With DNNExplorers benchmarking and exploration features, we can be ahead at building and optimizing customized AI accelerators and enable more efficient AI applications.
The advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors has brought on new opportunities for applying both Deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can fa cilitate the advancement of medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies including emerging memristive devices, Field Programmable Gate Arrays (FPGAs), and Complementary Metal Oxide Semiconductor (CMOS) can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. The tutorial is augmented with case studies of the vast literature on neural network and neuromorphic hardware as applied to the healthcare domain. We benchmark various hardware platforms by performing a sensor fusion signal processing task combining electromyography (EMG) signals with computer vision. Comparisons are made between dedicated neuromorphic processors and embedded AI accelerators in terms of inference latency and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that various accelerators and neuromorphic processors introduce to healthcare and biomedical domains.
Virtual memory (VM) is critical to the usability and programmability of hardware accelerators. Unfortunately, implementing accelerator VM efficiently is challenging because the area and power constraints make it difficult to employ the large multi-le vel TLBs used in general-purpose CPUs. Recent research proposals advocate a number of restrictions on virtual-to-physical address mappings in order to reduce the TLB size or increase its reach. However, such restrictions are unattractive because they forgo many of the original benefits of traditional VM, such as demand paging and copy-on-write. We propose SPARTA, a divide and conquer approach to address translation. SPARTA splits the address translation into accelerator-side and memory-side parts. The accelerator-side translation hardware consists of a tiny TLB covering only the accelerators cache hierarchy (if any), while the translation for main memory accesses is performed by shared memory-side TLBs. Performing the translation for memory accesses on the memory side allows SPARTA to overlap data fetch with translation, and avoids the replication of TLB entries for data shared among accelerators. To further improve the performance and efficiency of the memory-side translation, SPARTA logically partitions the memory space, delegating translation to small and efficient per-partition translation hardware. Our evaluation on index-traversal accelerators shows that SPARTA virtually eliminates translation overhead, reducing it by over 30x on average (up to 47x) and improving performance by 57%. At the same time, SPARTA requires minimal accelerator-side translation hardware, reduces the total number of TLB entries in the system, gracefully scales with memory size, and preserves all key VM functionalities.
To speedup Deep Neural Networks (DNN) accelerator design and enable effective implementation, we propose HybridDNN, a framework for building high-performance hybrid DNN accelerators and delivering FPGA-based hardware implementations. Novel techniques include a highly flexible and scalable architecture with a hybrid Spatial/Winograd convolution (CONV) Processing Engine (PE), a comprehensive design space exploration tool, and a complete design flow to fully support accelerator design and implementation. Experimental results show that the accelerators generated by HybridDNN can deliver 3375.7 and 83.3 GOPS on a high-end FPGA (VU9P) and an embedded FPGA (PYNQ-Z1), respectively, which achieve a 1.8x higher performance improvement compared to the state-of-art accelerator designs. This demonstrates that HybridDNN is flexible and scalable and can target both cloud and embedded hardware platforms with vastly different resource constraints.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا