Massively parallel lattice-Boltzmann codes on large GPU clusters

83 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Alessandro Gabbana

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف E. Calore - A. Gabbana - J. Kraus

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper describes a massively parallel code for a state-of-the art thermal lattice- Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bot- tlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and op- timization methodology that can be used for the development of other high performance applications for computational physics.

قيم البحث

209 - Soheil Behnezhad , Laxman Dhulipala , Hossein Esfandiari 2019

We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms using MapReduce and a distributed hash table service. This extension allows us to give new graph algorithms with much lower round complexities compared to the best known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in $O(1)$ rounds and connectivity/minimum spanning tree in $O(loglog_{m/n} n)$ rounds both using $O(n^delta)$ space per machine for constant $delta < 1$. In the same memory regime for MPC, the best known algorithms for these problems require polylog $n$ rounds. Our results imply that the 2-Cycle conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Massively Parallel Correlation Clustering in Bounded Arboricity Graphs

86 - Melanie Cambus , Davin Choo , Havu Miikonen 2021

Identifying clusters of similar elements in a set is a common task in data analysis. With the immense growth of data and physical limitations on single processor speed, it is necessary to find efficient parallel algorithms for clustering tasks. In th is paper, we study the problem of correlation clustering in bounded arboricity graphs with respect to the Massively Parallel Computation (MPC) model. More specifically, we are given a complete graph where the edges are either positive or negative, indicating whether pairs of vertices are similar or dissimilar. The task is to partition the vertices into clusters with as few disagreements as possible. That is, we want to minimize the number of positive inter-cluster edges and negative intra-cluster edges. Consider an input graph $G$ on $n$ vertices such that the positive edges induce a $lambda$-arboric graph. Our main result is a 3-approximation ($textit{in expectation}$) algorithm to correlation clustering that runs in $mathcal{O}(log lambda cdot textrm{poly}(log log n))$ MPC rounds in the $textit{strongly sublinear memory regime}$. This is obtained by combining structural properties of correlation clustering on bounded arboricity graphs with the insights of Fischer and Noever (SODA 18) on randomized greedy MIS and the $texttt{PIVOT}$ algorithm of Ailon, Charikar, and Newman (STOC 05). Combined with known graph matching algorithms, our structural property also implies an exact algorithm and algorithms with $textit{worst case}$ $(1+epsilon)$-approximation guarantees in the special case of forests, where $lambda=1$.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

111 - Myeongjae Jeon , Shivaram Venkataraman , Amar Phanishayee 2019

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

النظم الموزعة والتوازية والحوسبة العنقودية

Equivalence Classes and Conditional Hardness in Massively Parallel Computations

141 - Danupon Nanongkai , Michele Scquizzato 2020

The Massively Parallel Computation (MPC) model serves as a common abstraction of many modern large-scale data processing frameworks, and has been receiving increasingly more attention over the past few years, especially in the context of classical gr aph problems. So far, the only way to argue lower bounds for this model is to condition on conjectures about the hardness of some specific problems, such as graph connectivity on promise graphs that are either one cycle or two cycles, usually called the one cycle vs. two cycles problem. This is unlike the traditional arguments based on conjectures about complexity classes (e.g., $textsf{P} eq textsf{NP}$), which are often more robust in the sense that refuting them would lead to groundbreaking algorithms for a whole bunch of problems. In this paper we present connections between problems and classes of problems that allow the latter type of arguments. These connections concern the class of problems solvable in a sublogarithmic amount of rounds in the MPC model, denoted by $textsf{MPC}(o(log N))$, and some standard classes concerning space complexity, namely $textsf{L}$ and $textsf{NL}$, and suggest conjectures that are robust in the sense that refuting them would lead to many surprisingly fast new algorithms in the MPC model. We also obtain new conditional lower bounds, and prove new reductions and equivalences between problems in the MPC model.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations

130 - Markus Wittmann , Thomas Zeiser , Georg Hager 2011

We present a simple, parallel and distributed algorithm for setting up and partitioning a sparse representation of a regular discretized simulation domain. This method is scalable for a large number of processes even for complex geometries and ensure s load balance between the domains, reasonable communication interfaces, and good data locality within the domain. Applying this scheme to a list-based lattice Boltzmann flow solver can achieve similar or even higher flow solver performance than widely used standard graph partition based tools such as METIS and PT-SCOTCH.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات