Do you want to publish a course? Click here

Hierarchical $mathcal{H}^2$-matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their $O(N)$ complexity in both memory and operator application makes them particularly suited for large-scale problems. As a result, there is a need for software that provides support for distributed operations on these matrices to allow large-scale problems to be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the $mathcal{H}^2$ format. The algorithms are a new module of H2Opus, a performance-oriented package that supports a broad variety of $mathcal{H}^2$-matrix operations on CPUs and GPUs. Performance in the distributed GPU setting is achieved by marshaling the tree data of the hierarchical matrix representation to allow batched kernels to be executed on the individual GPUs. MPI is used for inter-process communication. We optimize the communication data volume and hide much of the communication cost with local compute phases of the algorithms. Results show near-ideal scalability up to 1024 NVIDIA V100 GPUs on Summit, with performance exceeding 2.3 Tflop/s/GPU for the matrix-vector multiplication, and 670 Gflops/s/GPU for matrix compression, which involves batched QR and SVD operations. We illustrate the flexibility and efficiency of the library by solving a 2D variable diffusivity integral fractional diffusion problem with an algebraic multigrid-preconditioned Krylov solver and demonstrate scalability up to 16M degrees of freedom problems on 64 GPUs.
Tile low rank representations of dense matrices partition them into blocks of roughly uniform size, where each off-diagonal tile is compressed and stored as its own low rank factorization. They offer an attractive representation for many data-sparse dense operators that appear in practical applications, where substantial compression and a much smaller memory footprint can be achieved. TLR matrices are a compromise between the simplicity of a regular perfectly-strided data structure and the optimal complexity of the unbalanced trees of hierarchically low rank matrices, and provide a convenient performance-tuning parameter through their tile size that can be proportioned to take into account the cache size where the tiles reside in the memory hierarchy. There are currently no high-performance algorithms that can generate Cholesky and $LDL^T$ factorizations, particularly on GPUs. The difficulties in achieving high performance when factoring TLR matrices come from the expensive compression operations that must be performed during the factorization process and the adaptive rank distribution of the tiles that causes an irregular work pattern for the processing cores. In this work, we develop a dynamic batching operation and combine it with batched adaptive randomized approximations to achieve high performance both on GPUs and CPUs. Our implementation attains over 1.2 TFLOP/s in double precision on the V100 GPU, and is limited by the performance of batched GEMM operations. The Cholesky factorization of covariance matrix of size $N = 131K$ arising in spatial statistics can be factored to an accuracy $epsilon=10^{-2}$ in just a few seconds. We believe the proposed GEMM-centric algorithm allows it to be readily ported to newer hardware such as the tensor cores that are optimized for small GEMM operations.
We present a robust and scalable preconditioner for the solution of large-scale linear systems that arise from the discretization of elliptic PDEs amenable to rank compression. The preconditioner is based on hierarchical low-rank approximations and the cyclic reduction method. The setup and application phases of the preconditioner achieve log-linear complexity in memory footprint and number of operations, and numerical experiments exhibit good weak and strong scalability at large processor counts in a distributed memory environment. Numerical experiments with linear systems that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and variable coefficients demonstrate the preconditioner applicability and robustness. Furthermore, it is possible to control the number of iterations via the accuracy threshold of the hierarchical matrix approximations and their arithmetic operations, and the tuning of the admissibility condition parameter. Together, these parameters allow for optimization of the memory requirements and performance of the preconditioner.
We present Accelerated Cyclic Reduction (ACR), a distributed-memory fast direct solver for rank-compressible block tridiagonal linear systems arising from the discretization of elliptic operators, developed here for three dimensions. Algorithmic synergies between Cyclic Reduction and hierarchical matrix arithmetic operations result in a solver that has $O(k~N log N~(log N + k^2))$ arithmetic complexity and $O(k~N log N)$ memory footprint, where $N$ is the number of degrees of freedom and $k$ is the rank of a typical off-diagonal block, and which exhibits substantial concurrency. We provide a baseline for performance and applicability by comparing with the multifrontal method where hierarchical semi-separable matrices are used for compressing the fronts, and with algebraic multigrid. Over a set of large-scale elliptic systems with features of nonsymmetry and indefiniteness, the robustness of the direct solvers extends beyond that of the multigrid solver, and relative to the multifrontal approach ACR has lower or comparable execution time and memory footprint. ACR exhibits good strong and weak scaling in a distributed context and, as with any direct solver, is advantageous for problems that require the solution of multiple right-hand sides.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا