No Arabic abstract
We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by every processor. We present various numerical results to demonstrate the versatility and scalability of the parallel algorithm.
We study the use of Krylov subspace recycling for the solution of a sequence of slowly-changing families of linear systems, where each family consists of shifted linear systems that differ in the coefficient matrix only by multiples of the identity. Our aim is to explore the simultaneous solution of each family of shifted systems within the framework of subspace recycling, using one augmented subspace to extract candidate solutions for all the shifted systems. The ideal method would use the same augmented subspace for all systems and have fixed storage requirements, independent of the number of shifted systems per family. We show that a method satisfying both requirements cannot exist in this framework. As an alternative, we introduce two schemes. One constructs a separate deflation space for each shifted system but solves each family of shifted systems simultaneously. The other builds only one recycled subspace and constructs approximate corrections to the solutions of the shifted systems at each cycle of the iterative linear solver while only minimizing the base system residual. At convergence of the base system solution, we apply the method recursively to the remaining unconverged systems. We present numerical examples involving systems arising in lattice quantum chromodynamics.
We introduce a randomized algorithm, namely RCHOL, to construct an approximate Cholesky factorization for a given Laplacian matrix (a.k.a., graph Laplacian). From a graph perspective, the exact Cholesky factorization introduces a clique in the underlying graph after eliminating a row/column. By randomization, RCHOL only retains a sparse subset of the edges in the clique using a random sampling developed by Spielman and Kyng. We prove RCHOL is breakdown-free and apply it to solving large sparse linear systems with symmetric diagonally dominant matrices. In addition, we parallelize RCHOL based on the nested dissection ordering for shared-memory machines. We report numerical experiments that demonstrate the robustness and the scalability of RCHOL. For example, our parallel code scaled up to 64 threads on a single node for solving the 3D Poisson equation, discretized with the 7-point stencil on a $1024times 1024 times 1024$ grid, a problem that has one billion unknowns.
The linear equations that arise in interior methods for constrained optimization are sparse symmetric indefinite and become extremely ill-conditioned as the interior method converges. These linear systems present a challenge for existing solver frameworks based on sparse LU or LDL^T decompositions. We benchmark five well known direct linear solver packages using matrices extracted from power grid optimization problems. The achieved solution accuracy varies greatly among the packages. None of the tested packages delivers significant GPU acceleration for our test cases.
We present Accelerated Cyclic Reduction (ACR), a distributed-memory fast direct solver for rank-compressible block tridiagonal linear systems arising from the discretization of elliptic operators, developed here for three dimensions. Algorithmic synergies between Cyclic Reduction and hierarchical matrix arithmetic operations result in a solver that has $O(k~N log N~(log N + k^2))$ arithmetic complexity and $O(k~N log N)$ memory footprint, where $N$ is the number of degrees of freedom and $k$ is the rank of a typical off-diagonal block, and which exhibits substantial concurrency. We provide a baseline for performance and applicability by comparing with the multifrontal method where hierarchical semi-separable matrices are used for compressing the fronts, and with algebraic multigrid. Over a set of large-scale elliptic systems with features of nonsymmetry and indefiniteness, the robustness of the direct solvers extends beyond that of the multigrid solver, and relative to the multifrontal approach ACR has lower or comparable execution time and memory footprint. ACR exhibits good strong and weak scaling in a distributed context and, as with any direct solver, is advantageous for problems that require the solution of multiple right-hand sides.
Fast and accurate solution of time-dependent partial differential equations (PDEs) is of key interest in many research fields including physics, engineering, and biology. Generally, implicit schemes are preferred over the explicit ones for better stability and correctness. The existing implicit schemes are usually iterative and employ a general-purpose solver which may be sub-optimal for a specific class of PDEs. In this paper, we propose a neural solver to learn an optimal iterative scheme for a class of PDEs, in a data-driven fashion. We attain this objective by modifying an iteration of an existing semi-implicit solver using a deep neural network. Further, we prove theoretically that our approach preserves the correctness and convergence guarantees provided by the existing iterative-solvers. We also demonstrate that our model generalizes to a different parameter setting than the one seen during training and achieves faster convergence compared to the semi-implicit schemes.