Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations

372 0 0.0 ( 0 )

Download Cite

Added by Markus Wittmann

Publication date 2011

fields Informatics Engineering

and research's language is English

Authors Markus Wittmann - Thomas Zeiser - Georg Hager

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present a simple, parallel and distributed algorithm for setting up and partitioning a sparse representation of a regular discretized simulation domain. This method is scalable for a large number of processes even for complex geometries and ensures load balance between the domains, reasonable communication interfaces, and good data locality within the domain. Applying this scheme to a list-based lattice Boltzmann flow solver can achieve similar or even higher flow solver performance than widely used standard graph partition based tools such as METIS and PT-SCOTCH.

rate research

Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers

122 - E. Calore , A. Gabbana , S. F. Schifano 2017

High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting performance and energy efficiency. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper we consider exactly this problem for a class of applications based on Lattice Boltzmann Methods, widely used in computational fluid-dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit efficiently the different parallel and vector options of the various accelerators, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using as testbeds HPC clusters incorporating different accelerators: Intel Xeon-Phi many-core processors, NVIDIA GPUs and AMD GPUs.

Distributed Parallel and Cluster Computing

Massively parallel lattice-Boltzmann codes on large GPU clusters

82 - E. Calore , A. Gabbana , J. Kraus 2017

This paper describes a massively parallel code for a state-of-the art thermal lattice- Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bot- tlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and op- timization methodology that can be used for the development of other high performance applications for computational physics.

Distributed Parallel and Cluster Computing

Harnessing the Power of Serverless Runtimes for Large-Scale Optimization

103 - Arda Aytekin , Mikael Johansson 2019

The event-driven and elastic nature of serverless runtimes makes them a very efficient and cost-effective alternative for scaling up computations. So far, they have mostly been used for stateless, data parallel and ephemeral computations. In this work, we propose using serverless runtimes to solve generic, large-scale optimization problems. Specifically, we build a master-worker setup using AWS Lambda as the source of our workers, implement a parallel optimization algorithm to solve a regularized logistic regression problem, and show that relative speedups up to 256 workers and efficiencies above 70% up to 64 workers can be expected. We also identify possible algorithmic and system-level bottlenecks, propose improvements, and discuss the limitations and challenges in realizing these improvements.

Distributed Parallel and Cluster Computing Machine Learning Optimization and Control

Enabling Distributed Optimization in Large-Scale Power Systems

95 - Junyao Guo , Gabriela Hug , 2016

Distributed optimization for solving non-convex Optimal Power Flow (OPF) problems in power systems has attracted tremendous attention in the last decade. Most studies are based on the geographical decomposition of IEEE test systems for verifying the feasibility of the proposed approaches. However, it is not clear if one can extrapolate from these studies that those approaches can be applied to very large-scale real-world systems. In this paper, we show, for the first time, that distributed optimization can be effectively applied to a large-scale real transmission network, namely, the Polish 2383-bus system for which no pre-defined partitions exist, by using a recently developed partitioning technique. More specifically, the problem solved is the AC OPF problem with geographical decomposition of the network using the Alternating Direction Method of Multipliers (ADMM) method in conjunction with the partitioning technique. Through extensive experimental results and analytical studies, we show that with the presented partitioning technique the convergence performance of ADMM can be improved substantially, which enables the application of distributed approaches on very large-scale systems.

Distributed Parallel and Cluster Computing Optimization and Control

Large Scale GPU Accelerated PPMLR-MHD Simulations for Space Weather Forecast

464 - Xiangyu Guo , Binbin Tang , Jian Tao 2016

PPMLR-MHD is a new magnetohydrodynamics (MHD) model used to simulate the interactions of the solar wind with the magnetosphere, which has been proved to be the key element of the space weather cause-and-effect chain process from the Sun to Earth. Compared to existing MHD methods, PPMLR-MHD achieves the advantage of high order spatial accuracy and low numerical dissipation. However, the accuracy comes at a cost. On one hand, this method requires more intensive computation. On the other hand, more boundary data is subject to be transferred during the process of simulation.s In this work, we present a parallel hybrid solution of the PPMLR-MHD model implemented using the computing capabilities of both CPUs and GPUs. We demonstrate that our optimized implementation alleviates the data transfer overhead by using GPU Direct technology and can scale up to 151 processes and achieve significant performance gains by distributing the workload among the CPUs and GPUs on Titan at Oak Ridge National Laboratory. The performance results show that our implementation is fast enough to carry out highly accurate MHD simulations in real time.

Distributed Parallel and Cluster Computing