Performance Evaluation of Advanced Features in CUDA Unified Memory

300 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Steven W. D. Chien

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Steven W. D. Chien - Ivy B. Peng - Stefano Markidis

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to evaluate both in-memory and oversubscription performance. The results show that memory advises on the Intel-Volta/Pascal-PCIe platform bring negligible improvement for in-memory executions. However, when GPU memory is oversubscribed by about 50%, using memory advises results in up to 25% performance improvement compared to the basic CUDA Unified Memory. In contrast, the Power9-Volta-NVLink platform can substantially benefit from memory advises, achieving up to 34% performance gain for in-memory executions. However, when GPU memory is oversubscribed on this platform, using memory advises increases GPU page faults and results in considerable performance loss. The CUDA prefetch also shows different performance impact on the two platforms. It improves performance by up to 50% on the Intel-Volta/Pascal-PCI-E platform but brings little benefit to the Power9-Volta-NVLink platform.

قيم البحث

118 - Riccardo Spezialetti , Samuele Salti , Luigi Di Stefano 2019

Matching surfaces is a challenging 3D Computer Vision problem typically addressed by local features. Although a variety of 3D feature detectors and descriptors has been proposed in literature, they have seldom been proposed together and it is yet not clear how to identify the most effective detector-descriptor pair for a specific application. A promising solution is to leverage machine learning to learn the optimal 3D detector for any given 3D descriptor [15]. In this paper, we report a performance evaluation of the detector-descriptor pairs obtained by learning a paired 3D detector for the most popular 3D descriptors. In particular, we address experimental settings dealing with object recognition and surface registration.

الرؤية الحاسوبية وتمييز الأنماط

Performance evaluation over HW/SW co-design SoC memory transfers for a CNN accelerator

74 - A. Rios-Navarro , R. Tapiador-Morales , A. Jimenez-Fernandez 2018

Many FPGAs vendors have recently included embedded processors in their devices, like Xilinx with ARM-Cortex A cores, together with programmable logic cells. These devices are known as Programmable System on Chip (PSoC). Their ARM cores (embedded in t he processing system or PS) communicates with the programmable logic cells (PL) using ARM-standard AXI buses. In this paper we analyses the performance of exhaustive data transfers between PS and PL for a Xilinx Zynq FPGA in a co-design real scenario for Convolutional Neural Networks (CNN) accelerator, which processes, in dedicated hardware, a stream of visual information from a neuromorphic visual sensor for classification. In the PS side, a Linux operating system is running, which recollects visual events from the neuromorphic sensor into a normalized frame, and then it transfers these frames to the accelerator of multi-layered CNNs, and read results, using an AXI-DMA bus in a per-layer way. As these kind of accelerators try to process information as quick as possible, data bandwidth becomes critical and maintaining a good balanced data throughput rate requires some considerations. We present and evaluate several data partitioning techniques to improve the balance between RX and TX transfer and two different ways of transfers management: through a polling routine at the userlevel of the OS, and through a dedicated interrupt-based kernellevel driver. We demonstrate that for longer enough packets, the kernel-level driver solution gets better timing in computing a CNN classification example. Main advantage of using kernel-level driver is to have safer solutions and to have tasks scheduling in the OS to manage other important processes for our application, like frames collection from sensors and their normalization.

النظم الموزعة والتوازية والحوسبة العنقودية

CRUM: Checkpoint-Restart Support for CUDAs Unified Memory

152 - Rohan Garg , Apoore Mohan , Michael Sullivan 2018

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming s tyle is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings are GPU-accelerated, with a current trend of ten additional GPU-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing.

النظم الموزعة والتوازية والحوسبة العنقودية

Optimizing Memory Performance of Xilinx FPGAs under Vitis

291 - Ruoshi Li , Hongjing Huang , Zeke Wang 2020

Plenty of research efforts have been devoted to FPGA-based acceleration, due to its low latency and high energy efficiency. However, using the original low-level hardware description languages like Verilog to program FPGAs requires generally good kno wledge of hardware design details and hand-on experiences. Fortunately, the FPGA community intends to address this low programmability issues. For example, , with the intention that programming FPGAs is just as easy as programming GPUs. Even though Vitis is proven to increase programmability, we cannot directly obtain high performance without careful design regarding hardware pipeline and memory subsystem.In this paper, we focus on the memory subsystem, comprehensively and systematically benchmarking the effect of optimization methods on memory performance. Upon benchmarking, we quantitatively analyze the typical memory access patterns for a broad range of applications, including AI, HPC, and database. Further, we also provide the corresponding optimization direction for each memory access pattern so as to improve overall performance.

النظم الموزعة والتوازية والحوسبة العنقودية هندسة العتاد

Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

105 - Youngmoon Lee , Hasan Al Maruf , Mosharaf Chowdhury 2019

We present the design and implementation of a low-latency, low-overhead, and highly available resilient disaggregated cluster memory. Our proposed framework can access erasure-coded remote memory within a single-digit {mu}s read/write latency, signif icantly improving the performance-efficiency tradeoff over the state-of-the-art - it performs similar to in-memory replication with 1.6x lower memory overhead. We also propose a novel coding group placement algorithm for erasure-coded data, that provides load balancing while reducing the probability of data loss under correlated failures by an order of magnitude.

النظم الموزعة والتوازية والحوسبة العنقودية بنية الشبكات والإنترنت

سجل دخول لتتمكن من نشر تعليقات