COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators

82 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Luca Piccolboni

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Luca Piccolboni - Paolo Mantovani - Giuseppe Di Guglielmo

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Hardware accelerators are key to the efficiency and performance of system-on-chip (SoC) architectures. With high-level synthesis (HLS), designers can easily obtain several performance-cost trade-off implementations for each component of a complex hardware accelerator. However, navigating this design space in search of the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS, an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates both HLS and memory optimization tools in a compositional way. First, thanks to the co-design of datapath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the desired trade-off point between cost and performance at the system level. When applied to the system-level design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6x.

قيم البحث

219 - Taekyung Heo , Seunghyo Kang , Sanghyeon Lee 2021

Memory disaggregation provides efficient memory utilization across network-connected systems. It allows a node to use part of memory in remote nodes in the same cluster. Recent studies have improved RDMA-based memory disaggregation systems, supportin g lower latency and higher bandwidth than the prior generation of disaggregated memory. However, the current disaggregated memory systems manage remote memory only at coarse granularity due to the limitation of the access validation mechanism of RDMA. In such systems, to support fine-grained remote page allocation, the trustworthiness of all participating systems needs to be assumed, and thus a security breach in a node can propagate to the entire cluster. From the security perspective, the memory-providing node must protect its memory from memory-requesting nodes. On the other hand, the memory-requesting node requires the confidentiality and integrity protection of its memory contents even if they are stored in remote nodes. To address the weak isolation support in the current system, this study proposes a novel hardware-assisted memory disaggregation system. Based on the security features of FPGA, the logic in each per-node FPGA board provides a secure memory disaggregation engine. With its own networks, a set of FPGA-based engines form a trusted memory disaggregation system, which is isolated from the privileged software of each participating node. The secure memory disaggregation system allows fine-grained memory management in memory-providing nodes, while the access validation is guaranteed with the hardware-hardened mechanism. In addition, the proposed system hides the memory access patterns observable from remote nodes, supporting obliviousness. Our evaluation with FPGA implementation shows that such fine-grained secure disaggregated memory is feasible with comparable performance to the latest software-based techniques.

النظم الموزعة والتوازية والحوسبة العنقودية التشفير والأمن

On the Optimization of Behavioral Logic Locking for High-Level Synthesis

97 - Christian Pilato , Luca Collini , Luca Cassano 2021

The globalization of the electronics supply chain is requiring effective methods to thwart reverse engineering and IP theft. Logic locking is a promising solution but there are still several open concerns. Even when applied at high level of abstracti on, logic locking leads to large overhead without guaranteeing that the obfuscation metric is actually maximized. We propose a framework to optimize the use of behavioral logic locking for a given security metric. We explore how to apply behavioral logic locking techniques during the HLS of IP cores. Operating on the chip behavior, our method is compatible with commercial HLS tools, complementing existing industrial design flows. We offer a framework where the designer can implement different meta-heuristics to explore the design space and select where to apply logic locking. Our method optimizes a given security metric better than complete obfuscation, allows us to 1) obtain better protection, 2) reduce the obfuscation cost.

هندسة العتاد التشفير والأمن

Application-level Studies of Cellular Neural Network-based Hardware Accelerators

118 - Qiuwen Lou , Indranil Palit , Tang Li 2019

As cost and performance benefits associated with Moores Law scaling slow, researchers are studying alternative architectures (e.g., based on analog and/or spiking circuits) and/or computational models (e.g., convolutional and recurrent neural network s) to perform application-level tasks faster, more energy efficiently, and/or more accurately. We investigate cellular neural network (CeNN)-based co-processors at the application-level for these metrics. While it is well-known that CeNNs can be well-suited for spatio-temporal information processing, few (if any) studies have quantified the energy/delay/accuracy of a CeNN-friendly algorithm and compared the CeNN-based approach to the best von Neumann algorithm at the application level. We present an evaluation framework for such studies. As a case study, a CeNN-friendly target-tracking algorithm was developed and mapped to an array architecture developed in conjunction with the algorithm. We compare the energy, delay, and accuracy of our architecture/algorithm (assuming all overheads) to the most accurate von Neumann algorithm (Struck). Von Neumann CPU data is measured on an Intel i5 chip. The CeNN approach is capable of matching the accuracy of Struck, and can offer approximately 1000x improvements in energy-delay product.

التقنيات الناشئة الرؤية الحاسوبية وتمييز الأنماط النظم الموزعة والتوازية والحوسبة العنقودية

MELOPPR: Software/Hardware Co-design for Memory-efficient Low-latency Personalized PageRank

180 - Lixiang Li , Yao Chen , Zacharie Zirnheld 2021

Personalized PageRank (PPR) is a graph algorithm that evaluates the importance of the surrounding nodes from a source node. Widely used in social network related applications such as recommender systems, PPR requires real-time responses (latency) for a better user experience. Existing works either focus on algorithmic optimization for improving precision while neglecting hardware implementations or focus on distributed global graph processing on large-scale systems for improving throughput rather than response time. Optimizing low-latency local PPR algorithm with a tight memory budget on edge devices remains unexplored. In this work, we propose a memory-efficient, low-latency PPR solution, namely MeLoPPR, with largely reduced memory requirement and a flexible trade-off between latency and precision. MeLoPPR is composed of stage decomposition and linear decomposition and exploits the node score sparsity: Through stage and linear decomposition, MeLoPPR breaks the computation on a large graph into a set of smaller sub-graphs, that significantly saves the computation memory; Through sparsity exploitation, MeLoPPR selectively chooses the sub-graphs that contribute the most to the precision to reduce the required computation. In addition, through software/hardware co-design, we propose a hardware implementation on a hybrid CPU and FPGA accelerating platform, that further speeds up the sub-graph computation. We evaluate the proposed MeLoPPR on memory-constrained devices including a personal laptop and Xilinx Kintex-7 KC705 FPGA using six real-world graphs. First, MeLoPPR demonstrates significant memory saving by 1.5x to 13.4x on CPU and 73x to 8699x on FPGA. Second, MeLoPPR allows flexible trade-offs between precision and execution time: when the precision is 80%, the speedup on CPU is up to 15x and up to 707x on FPGA; when the precision is around 90%, the speedup is up to 70x on FPGA.

النظم الموزعة والتوازية والحوسبة العنقودية هندسة العتاد

Clio: A Hardware-Software Co-Designed Disaggregated Memory System

164 - Zhiyuan Guo , Yizhou Shan , Xuhao Luo 2021

Memory disaggregation has attracted great attention recently because of its benefits in efficient memory utilization and ease of management. So far, memory disaggregation research has all taken one of two approaches, building/emulating memory nodes w ith either regular servers or raw memory devices with no processing power. The former incurs higher monetary cost and face tail latency and scalability limitations, while the latter introduce performance, security, and management problems. Server-based memory nodes and memory nodes with no processing power are two extreme approaches. We seek a sweet spot in the middle by proposing a hardware-based memory disaggregation solution that has the right amount of processing power at memory nodes. Furthermore, we take a clean-slate approach by starting from the requirements of memory disaggregation and designing a memory-disaggregation-native system. We propose a hardware-based disaggregated memory system, Clio, that virtualizes and manages disaggregated memory at the memory node. Clio includes a new hardware-based virtual memory system, a customized network system, and a framework for computation offloading. In building Clio, we not only co-design OS functionalities, hardware architecture, and the network system, but also co-design the compute node and memory node. We prototyped Clios memory node with FPGA and implemented its client-node functionalities in a user-space library. Clio achieves 100 Gbps throughput and an end-to-end latency of 2.5 us at median and 3.2 us at the 99th percentile. Clio scales much better and has orders of magnitude lower tail latency than RDMA, and it has 1.1x to 3.4x energy saving compared to CPU-based and SmartNIC-based disaggregated memory systems and is 2.7x faster than software-based SmartNIC solutions.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات