An Adaptive Load Balancer For Graph Analytical Applications on GPUs

169 0 0.0 ( 0 )

Download Cite

Added by Vishwesh Jatala

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Vishwesh Jatala - Loc Hoang - Roshan Dathathri

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Load-balancing among the threads of a GPU for graph analytics workloads is difficult because of the irregular nature of graph applications and the high variability in vertex degrees, particularly in power-law graphs. We describe a novel load balancing scheme to address this problem. Our scheme is implemented in the IrGL compiler to allow users to generate efficient load balanced code for a GPU from high-level sequential programs. We evaluated several graph analytics applications on up to 16 distributed GPUs using IrGL to compile the code and the Gluon substrate for inter-GPU communication. Our experiments show that this scheme can achieve an average speed-up of 2.2x on inputs that suffer from severe load imbalance problems when previous state-of-the-art load-balancing schemes are used.

rate research

GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs

179 - Yuke Wang , Boyuan Feng , Gushu Li 2020

As the emerging trend of graph-based deep learning, Graph Neural Networks (GNNs) excel for their capability to generate high-quality node feature vectors (embeddings). However, the existing one-size-fits-all GNN implementations are insufficient to catch up with the evolving GNN architectures, the ever-increasing graph sizes, and the diverse node embedding dimensionalities. To this end, we propose textbf{GNNAdvisor}, an adaptive and efficient runtime system to accelerate various GNN workloads on GPU platforms. First, GNNAdvisor explores and identifies several performance-relevant features from both the GNN model and the input graph, and uses them as a new driving force for GNN acceleration. Second, GNNAdvisor implements a novel and highly-efficient 2D workload management, tailored for GNN computation to improve GPU utilization and performance under different application settings. Third, GNNAdvisor capitalizes on the GPU memory hierarchy for acceleration by gracefully coordinating the execution of GNNs according to the characteristics of the GPU memory structure and GNN workloads. Furthermore, to enable automatic runtime optimization, GNNAdvisor incorporates a lightweight analytical model for an effective design parameter search. Extensive experiments show that GNNAdvisor outperforms the state-of-the-art GNN computing frameworks, such as Deep Graph Library ($3.02times$ faster on average) and NeuGraph (up to $4.10times$ faster), on mainstream GNN architectures across various datasets.

Distributed Parallel and Cluster Computing

A Statistical Mechanical Load Balancer for the Web

176 - Jesse S. A. Bridgewater , P. Oscar Boykin , Vwani P. Roychowdhury 2004

The maximum entropy principle from statistical mechanics states that a closed system attains an equilibrium distribution that maximizes its entropy. We first show that for graphs with fixed number of edges one can define a stochastic edge dynamic that can serve as an effective thermalization scheme, and hence, the underlying graphs are expected to attain their maximum-entropy states, which turn out to be Erdos-Renyi (ER) random graphs. We next show that (i) a rate-equation based analysis of node degree distribution does indeed confirm the maximum-entropy principle, and (ii) the edge dynamic can be effectively implemented using short random walks on the underlying graphs, leading to a local algorithm for the generation of ER random graphs. The resulting statistical mechanical system can be adapted to provide a distributed and local (i.e., without any centralized monitoring) mechanism for load balancing, which can have a significant impact in increasing the efficiency and utilization of both the Internet (e.g., efficient web mirroring), and large-scale computing infrastructure (e.g., cluster and grid computing).

Disordered Systems and Neural Networks

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

567 - Santosh Pandey , Lingda Li , Adolfy Hoisie 2020

Many applications require to learn, mine, analyze and visualize large-scale graphs. These graphs are often too large to be addressed efficiently using conventional graph processing technologies. Many applications have requirements to analyze, transform, visualize and learn large scale graphs. These graphs are often too large to be addressed efficiently using conventional graph processing technologies. Recent literatures convey that graph sampling/random walk could be an efficient solution. In this paper, we propose, to the best of our knowledge, the first GPU-based framework for graph sampling/random walk. First, our framework provides a generic API which allows users to implement a wide range of sampling and random walk algorithms with ease. Second, offloading this framework on GPU, we introduce warp-centric parallel selection, and two optimizations for collision migration. Third, towards supporting graphs that exceed GPU memory capacity, we introduce efficient data transfer optimizations for out-of-memory sampling, such as workload-aware scheduling and batched multi-instance sampling. In its entirety, our framework constantly outperforms the state-of-the-art projects. First, our framework provides a generic API which allows users to implement a wide range of sampling and random walk algorithms with ease. Second, offloading this framework on GPU, we introduce warp-centric parallel selection, and two novel optimizations for collision migration. Third, towards supporting graphs that exceed the GPU memory capacity, we introduce efficient data transfer optimizations for out-of-memory and multi-GPU sampling, such as workload-aware scheduling and batched multi-instance sampling. Taken together, our framework constantly outperforms the state of the art projects in addition to the capability of supporting a wide range of sampling and random walk algorithms.

Distributed Parallel and Cluster Computing

An Efficient Cell List Implementation for Monte Carlo Simulation on GPUs

984 - Loren Schwiebert , Eyad Hailat , Kamel Rushaidat 2014

Maximizing the performance potential of the modern day GPU architecture requires judicious utilization of available parallel resources. Although dramatic reductions can often be obtained through straightforward mappings, further performance improvements often require algorithmic redesigns to more closely exploit the target architecture. In this paper, we focus on efficient molecular simulations for the GPU and propose a novel cell list algorithm that better utilizes its parallel resources. Our goal is an efficient GPU implementation of large-scale Monte Carlo simulations for the grand canonical ensemble. This is a particularly challenging application because there is inherently less computation and parallelism than in similar applications with molecular dynamics. Consistent with the results of prior researchers, our simulation results show traditional cell list implementations for Monte Carlo simulations of molecular systems offer effectively no performance improvement for small systems [5, 14], even when porting to the GPU. However for larger systems, the cell list implementation offers significant gains in performance. Furthermore, our novel cell list approach results in better performance for all problem sizes when compared with other GPU implementations with or without cell lists.

Distributed Parallel and Cluster Computing Computational Physics

C-Balancer: A System for Container Profiling and Scheduling

98 - Akshay Dhumal 2020

Linux containers have gained high popularity in recent times. This popularity is significantly due to various advantages of containers over Virtual Machines (VM). The containers are lightweight, occupy lesser storage, have fast boot-up time, easy to deploy and have faster auto-scaling. The key reason behind the popularity of containers is that they leverage the mechanism of micro-service style software development, where applications are designed as independently deployable services. There are various container orchestration tools for deploying and managing the containers in the cluster. The prominent among them are Docker Swarm and Kubernetes. However, they do not address the effects of resource contention when multiple containers are deployed on a node. Moreover, they do not provide support for container migration in the event of an attack or increased resource contention. To address such issues, we propose C-Balancer, a scheduling framework for efficient placement of containers in the cluster environment. C-Balancer works by periodically profiling the containers and deciding the optimal container to node placement. Our proposed approach improves the performance of containers in terms of resource utilization and throughput. Experiments using a workload mix of Stress-NG and iPerf benchmark shows that our proposed approach achieves a maximum performance improvement of 58% for the workload mix. Our approach also reduces the variance in resource utilization across the cluster by 60% on average.

Distributed Parallel and Cluster Computing