No Arabic abstract
We demonstrate the first implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. Taking as a generic test case a Type Ia supernova explosion with an extremely stiff thermonuclear network having 150 isotopic species and 1604 reactions coupled to hydrodynamics using operator splitting, we demonstrate the capability to solve of order 100 realistic kinetic networks in parallel in the same time that standard implicit methods can solve a single such network on a CPU. This orders-of-magnitude decrease in compute time for solving systems of realistic kinetic networks implies that important coupled, multiphysics problems in various scientific and technical fields that were intractible, or could be simulated only with highly schematic kinetic networks, are now computationally feasible.
This paper presents a Graphics Processing Units (GPUs) acceleration method of an iterative scheme for gas-kinetic model equations. Unlike the previous GPU parallelization of explicit kinetic schemes, this work features a fast converging iterative scheme. The memory reduction techniques in this method enable full three-dimensional (3D) solution of kinetic model equations in contemporary GPUs usually with a limited memory capacity that otherwise would need terabytes of memory. The GPU algorithm is validated against the DSMC simulation of the 3D lid-driven cavity flow and the supersonic rarefied gas flow past a cube with grids size up to 0.7 trillion points in the phase space. The performance of the GPU algorithm is assessed by comparing with the corresponding parallel CPU program using Message Passing Interface (MPI). The profiling on several models of GPUs shows that the algorithm has a medium to high level of utilization of the GPUs computing and memory resources. A $190times$ speedup can be achieved on the Tesla K40 GPUs against a single core of Intel Xeon-E5-2680v3 CPU for the 3D lid-driven cavity flow.
We report an efficient algorithm for calculating momentum-space integrals in solid state systems on modern graphics processing units (GPUs). Our algorithm is based on the tetrahedron method, which we demonstrate to be ideally suited for execution in a GPU framework. In order to achieve maximum performance, all floating point operations are executed in single precision. For benchmarking our implementation within the CUDA programming framework we calculate the orbital-resolved density of states in an iron-based superconductor. However, our algorithm is general enough for the achieved improvements to carry over to the calculation of other momentum integrals such as, e.g. susceptibilities. If our program code is integrated into an existing program for the central processing unit (CPU), i.e. when data transfer overheads exist, speedups of up to a factor $sim130$ compared to a pure CPU implementation can be achieved, largely depending on the problem size. In case our program code is integrated into an existing GPU program, speedups over a CPU implementation of up to a factor $sim165$ are possible, even for moderately sized workloads.
The Reynolds-Averaged Navier-Stokes equations and the Large-Eddy Simulation equations can be coupled using a transition function to switch from a set of equations applied in some areas of a domain to the other set in the other part of the domain. Following this idea, different time integration schemes can be coupled. In this context, we developed a hybrid time integration scheme that spatially couples the explicit scheme of Heun and the implicit scheme of Crank and Nicolson using a dedicated transition function. This scheme is linearly stable and second-order accurate. In this paper, an extension of this hybrid scheme is introduced to deal with a temporal adaptive procedure. The idea is to treat the time integration procedure with unstructured grids as it is performed with Cartesian grids with local mesh refinement. Depending on its characteristic size, each mesh cell is assigned a rank. And for two cells from two consecutive ranks, the ratio of the associated time steps for time marching the solutions is $2$. As a consequence, the cells with the lowest rank iterate more than the other ones to reach the same physical time. In a finite-volume context, a key ingredient is to keep the conservation property for the interfaces that separate two cells of different ranks. After introducing the different schemes, the paper recalls briefly the coupling procedure, and details the extension to the temporal adaptive procedure. The new time integration scheme is validated with the propagation of 1D wave packet, the Sods tube, and the transport of a bi-dimensional vortex in an uniform flow.
This paper addresses how two time integration schemes, the Heuns scheme for explicit time integration and the second-order Crank-Nicolson scheme for implicit time integration, can be coupled spatially. This coupling is the prerequisite to perform a coupled Large Eddy Simulation / Reynolds Averaged Navier-Stokes computation in an industrial context, using the implicit time procedure for the boundary layer (RANS) and the explicit time integration procedure in the LES region. The coupling procedure is designed in order to switch from explicit to implicit time integrations as fast as possible, while maintaining stability. After introducing the different schemes, the paper presents the initial coupling procedure adapted from a published reference and shows that it can amplify some numerical waves. An alternative procedure, studied in a coupled time/space framework, is shown to be stable and with spectral properties in agreement with the requirements of industrial applications. The coupling technique is validated with standard test cases, ranging from one-dimensional to three-dimensional flows.
In this paper, we use graphics processing units(GPU) to accelerate sparse and arbitrary structured neural networks. Sparse networks have nodes in the network that are not fully connected with nodes in preceding and following layers, and arbitrary structure neural networks have different number of nodes in each layers. Sparse Neural networks with arbitrary structures are generally created in the processes like neural network pruning and evolutionary machine learning strategies. We show that we can gain significant speedup for full activation of such neural networks using graphical processing units. We do a prepossessing step to determine dependency groups for all the nodes in a network, and use that information to guide the progression of activation in the neural network. Then we compute activation for each nodes in its own separate thread in the GPU, which allows for massive parallelization. We use CUDA framework to implement our approach and compare the results of sequential and GPU implementations. Our results show that the activation of sparse neural networks lends very well to GPU acceleration and can help speed up machine learning strategies which generate such networks or other processes that have similar structure.