A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

303 0 0.0 ( 0 )

Download Cite

Added by Yida Wang

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Leyuan Wang - Zhi Chen - Yizhi Liu

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Modern deep learning applications urge to push the model inference taking place at the edge devices for multiple reasons such as achieving shorter latency, relieving the burden of the network connecting to the cloud, and protecting user privacy. The Convolutional Neural Network (emph{CNN}) is one of the most widely used model family in the applications. Given the high computational complexity of the CNN models, it is favorable to execute them on the integrated GPUs at the edge devices, which are ubiquitous and have more power and better energy efficiency than the accompanying CPUs. However, programming on integrated GPUs efficiently is challenging due to the variety of their architectures and programming interfaces. This paper proposes an end-to-end solution to execute CNN model inference on the integrated GPUs at the edge, which uses a unified IR to represent and optimize vision-specific operators on integrated GPUs from multiple vendors, as well as leverages machine learning-based scheduling search schemes to optimize computationally-intensive operators like convolution. Our solution even provides a fallback mechanism for operators not suitable or convenient to run on GPUs. The evaluation results suggest that compared to state-of-the-art solutions backed up by the vendor-provided high-performance libraries on Intel Graphics, ARM Mali GPU, and Nvidia integrated Maxwell GPU, our solution achieves similar, or even better (up to 1.62$times$), performance on a number of popular image classification and object detection models. In addition, our solution has a wider model coverage and is more flexible to embrace new models. Our solution has been adopted in production services in AWS and is open-sourced.

rate research

Distributed CNN Inference on Resource-Constrained UAVs for Surveillance Systems: Design and Optimization

118 - Mohammed Jouhari , Abdulla Al-Ali , Emna Baccour 2021

Unmanned Aerial Vehicles (UAVs) have attracted great interest in the last few years owing to their ability to cover large areas and access difficult and hazardous target zones, which is not the case of traditional systems relying on direct observations obtained from fixed cameras and sensors. Furthermore, thanks to the advancements in computer vision and machine learning, UAVs are being adopted for a broad range of solutions and applications. However, Deep Neural Networks (DNNs) are progressing toward deeper and complex models that prevent them from being executed on-board. In this paper, we propose a DNN distribution methodology within UAVs to enable data classification in resource-constrained devices and avoid extra delays introduced by the server-based solutions due to data communication over air-to-ground links. The proposed method is formulated as an optimization problem that aims to minimize the latency between data collection and decision-making while considering the mobility model and the resource constraints of the UAVs as part of the air-to-air communication. We also introduce the mobility prediction to adapt our system to the dynamics of UAVs and the network variation. The simulation conducted to evaluate the performance and benchmark the proposed methods, namely Optimal UAV-based Layer Distribution (OULD) and OULD with Mobility Prediction (OULD-MP), were run in an HPC cluster. The obtained results show that our optimization solution outperforms the existing and heuristic-based approaches.

Distributed Parallel and Cluster Computing Machine Learning Systems and Control

A Performance Study of the 2D Ising Model on GPUs

107 - Joshua Romero , Mauro Bisson , Massimiliano Fatica 2019

The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities allowed us to quickly experiment with several implementation ideas: a simple stencil-based algorithm, recasting the stencil operations into matrix multiplies to take advantage of Tensor Cores available on NVIDIA GPUs, and a highly optimized multi-spin coding approach. Using the managed memory API available in CUDA allows for simple and efficient distribution of these implementations across a multi-GPU NVIDIA DGX-2 server. We show that even a basic GPU implementation can outperform current results published on TPUs and that the optimized multi-GPU implementation can simulate very large lattices faster than custom FPGA solutions.

Distributed Parallel and Cluster Computing

A Study of Mixed Precision Strategies for GMRES on GPUs

149 - Jennifer A. Loe , Christian A. Glusa , Ichitaro Yamazaki 2021

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of mixed precision strategies for accelerating this kernel on an NVIDIA V$100$ GPU with a Power 9 CPU. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when mixed precision GMRES will be effective and to choose parameters for a mixed precision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of mixed precision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

Optimized Password Recovery for Encrypted RAR on GPUs

268 - Xiaojing An , Haojun Zhao , Lulu Ding 2015

RAR uses classic symmetric encryption algorithm SHA-1 hashing and AES algorithm for encryption, and the only method of password recovery is brute force, which is very time-consuming. In this paper, we present an approach using GPUs to speed up the password recovery process. However, because the major calculation and time-consuming part, SHA-1 hashing, is hard to be parallelized, so this paper adopts coarse granularity parallel. That is, one GPU thread is responsible for the validation of one password. We mainly use three optimization methods to optimize this parallel version: asynchronous parallel between CPU and GPU, redundant calculations and conditional statements reduction, and the usage of registers optimization. Experiment result shows that the final version reaches 43~57 times speedup on an AMD FirePro W8000 GPU, compared to a well-optimized serial version on Intel Core i5 CPU.

Distributed Parallel and Cluster Computing Cryptography and Security

Accelerating Concurrent Heap on GPUs

90 - Yanhao Chen 2019

Priority queue, often implemented as a heap, is an abstract data type that has been used in many well-known applications like Dijkstras shortest path algorithm, Prims minimum spanning tree, Huffman encoding, and the branch-and-bound algorithm. However, it is challenging to exploit the parallelism of the heap on GPUs since the control divergence and memory irregularity must be taken into account. In this paper, we present a parallel generalized heap model that works effectively on GPUs. We also prove the linearizability of our generalized heap model which enables us to reason about the expected results. We evaluate our concurrent heap thoroughly and show a maximum 19.49X speedup compared to the sequential CPU implementation and 2.11X speedup compared with the existing GPU implementation. We also apply our heap to single source shortest path with up to 1.23X speedup and 0/1 knapsack problem with up to 12.19X speedup.

Distributed Parallel and Cluster Computing