No Arabic abstract
We present Accelerated Cyclic Reduction (ACR), a distributed-memory fast direct solver for rank-compressible block tridiagonal linear systems arising from the discretization of elliptic operators, developed here for three dimensions. Algorithmic synergies between Cyclic Reduction and hierarchical matrix arithmetic operations result in a solver that has $O(k~N log N~(log N + k^2))$ arithmetic complexity and $O(k~N log N)$ memory footprint, where $N$ is the number of degrees of freedom and $k$ is the rank of a typical off-diagonal block, and which exhibits substantial concurrency. We provide a baseline for performance and applicability by comparing with the multifrontal method where hierarchical semi-separable matrices are used for compressing the fronts, and with algebraic multigrid. Over a set of large-scale elliptic systems with features of nonsymmetry and indefiniteness, the robustness of the direct solvers extends beyond that of the multigrid solver, and relative to the multifrontal approach ACR has lower or comparable execution time and memory footprint. ACR exhibits good strong and weak scaling in a distributed context and, as with any direct solver, is advantageous for problems that require the solution of multiple right-hand sides.
We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by every processor. We present various numerical results to demonstrate the versatility and scalability of the parallel algorithm.
We consider several methods for generating initial guesses when iteratively solving sequences of linear systems, showing that they can be implemented efficiently in GPU-accelerated PDE solvers, specifically solvers for incompressible flow. We propose new initial guess methods based on stabilized polynomial extrapolation and compare them to the projection method of Fischer [15], showing that they are generally competitive with projection schemes despite requiring only half the storage and performing considerably less data movement and communication. Our implementations of these algorithms are freely available as part of the libParanumal collection of GPU-accelerated flow solvers.
Structured population models are a class of general evolution equations which are widely used in the study of biological systems. Many theoretical methods are available for establishing existence and stability of steady states of general evolution equations. However, except for very special cases, finding an analytical form of stationary solutions for evolution equations is a challenging task. In the present paper, we develop a numerical framework for computing approximations to stationary solutions of general evolution equations, which can also be used to produce existence and stability regions for steady states. In particular, we use the Trotter-Kato Theorem to approximate the infinitesimal generator of an evolution equation on a finite dimensional space, which in turn reduces the evolution equation into a system of ordinary differential equations. Consequently, we approximate and study the asymptotic behavior of stationary solutions. We illustrate the convergence of our numerical framework by applying it to a linear Sinko-Streifer structured population model for which the exact form of the steady state is known. To further illustrate the utility of our approach, we apply our framework to nonlinear population balance equation, which is an extension of well-known Smoluchowksi coagulation-fragmentation model to biological populations. We also demonstrate that our numerical framework can be used to gain insight about the theoretical stability of the stationary solutions of the evolution equations. Furthermore, the open source Python program that we have developed for our numerical simulations is freely available from our Github repository (github.com/MathBioCU).
We study the mathematical properties of a general model of cell division structured with several internal variables. We begin with a simpler and specific model with two variables, we solve the eigenvalue problem with strong or weak assumptions, and deduce from it the long-time convergence. The main difficulty comes from natural degeneracy of birth terms that we overcome with a regularization technique. We then extend the results to the case with several parameters and recall the link between this simplified model and the one presented in cite{CBBP1}; an application to the non-linear problem is also given, leading to robust subpolynomial growth of the total population.
With this work, we release CLAIRE, a distributed-memory implementation of an effective solver for constrained large deformation diffeomorphic image registration problems in three dimensions. We consider an optimal control formulation. We invert for a stationary velocity field that parameterizes the deformation map. Our solver is based on a globalized, preconditioned, inexact reduced space Gauss--Newton--Krylov scheme. We exploit state-of-the-art techniques in scientific computing to develop an effective solver that scales to thousands of distributed memory nodes on high-end clusters. We present the formulation, discuss algorithmic features, describe the software package, and introduce an improved preconditioner for the reduced space Hessian to speed up the convergence of our solver. We test registration performance on synthetic and real data. We demonstrate registration accuracy on several neuroimaging datasets. We compare the performance of our scheme against different flavors of the Demons algorithm for diffeomorphic image registration. We study convergence of our preconditioner and our overall algorithm. We report scalability results on state-of-the-art supercomputing platforms. We demonstrate that we can solve registration problems for clinically relevant data sizes in two to four minutes on a standard compute node with 20 cores, attaining excellent data fidelity. With the present work we achieve a speedup of (on average) 5$times$ with a peak performance of up to 17$times$ compared to our former work.