No Arabic abstract
We present an efficient, linear-scaling implementation for building the (screened) Hartree-Fock exchange (HFX) matrix for periodic systems within the framework of numerical atomic orbital (NAO) basis functions. Our implementation is based on the localized resolution of the identity approximation by which two-electron Coulomb repulsion integrals can be obtained by only computing two-center quantities -- a feature that is highly beneficial to NAOs. By exploiting the locality of basis functions and efficient prescreening of the intermediate three- and two-index tensors, one can achieve a linear scaling of the computational cost for building the HFX matrix with respect to the system size. Our implementation is massively parallel, thanks to a MPI/OpenMP hybrid parallelization strategy for distributing the computational load and memory storage. All these factors add together to enable highly efficient hybrid functional calculations for large-scale periodic systems. In this work we describe the key algorithms and implementation details for the HFX build as implemented in the ABACUS code package. The performance and scalability of our implementation with respect to the system size and the number of CPU cores are demonstrated for selected benchmark systems up to 4096 atoms.
Imaginary-time time-dependent Density functional theory (it-TDDFT) has been proposed as an alternative method for obtaining the ground state within density functional theory (DFT) which avoids some of the difficulties with convergence encountered by the self-consistent-field (SCF) iterative method. It-TDDFT was previously applied to clusters of atoms where it was demonstrated to converge in select cases where SCF had difficulty with convergence. In the present work we implement it-TDDFT propagation for {it periodic systems} by modifying the Quantum ESPRESSO package, which uses a plane-wave basis with multiple $boldsymbol{k}$ points, and has the options of non-collinear and DFT+U calculations using ultra-soft or norm-conserving pseudo potentials. We demonstrate that our implementation of it-TDDFT propagation with multiple $boldsymbol{k}$ points is correct for DFT+U non-collinear calculations and for DFT+U calculations with ultra-soft pseudo potentials. Our implementation of it-TDDFT propagation converges to the exact SCF energy (up to the decimal guaranteed by double precision) in all but one case where it converged to a slightly lower value than SCF, suggesting a useful alternative for systems where SCF has difficulty to reach the Kohn-Sham ground state. In addition, we demonstrate that rapid convergence can be achieved if we use adaptive-size imaginary-time-steps for different kinetic-energy plane-waves.
Real-time time-dependent density functional theory (rt-TDDFT) with hybrid exchange-correlation functional has wide-ranging applications in chemistry and material science simulations. However, it can be thousands of times more expensive than a conventional ground state DFT simulation, hence is limited to small systems. In this paper, we accelerate hybrid functional rt-TDDFT calculations using the parallel transport gauge formalism, and the GPU implementation on Summit. Our implementation can efficiently scale to 786 GPUs for a large system with 1536 silicon atoms, and the wall clock time is only 1.5 hours per femtosecond. This unprecedented speed enables the simulation of large systems with more than 1000 atoms using rt-TDDFT and hybrid functional.
Localized basis sets in the projector augmented wave formalism allow for computationally efficient calculations within density functional theory (DFT). However, achieving high numerical accuracy requires an extensive basis set, which also poses a fundamental problem for the interpretation of the results. We present a way to obtain a reduced basis set of atomic orbitals through the subdiagonalization of each atomic block of the Hamiltonian. The resulting local orbitals (LOs) inherit the information of the local crystal field. In the LO basis, it becomes apparent that the Hamiltonian is nearly block-diagonal, and we demonstrate that it is possible to keep only a subset of relevant LOs which provide an accurate description of the physics around the Fermi level. This reduces to some extent the redundancy of the original basis set, and at the same time it allows one to perform post-processing of DFT calculations, ranging from the interpretation of electron transport to extracting effective tight-binding Hamiltonians, very efficiently and without sacrificing the accuracy of the results.
This work presents a dynamic parallel distribution scheme for the Hartree-Fock exchange~(HFX) calculations based on the real-space NAO2GTO framework. The most time-consuming electron repulsion integrals~(ERIs) calculation is perfectly load-balanced with 2-level master-worker dynamic parallel scheme, the density matrix and the HFX matrix are both stored in the sparse format, the network communication time is minimized via only communicating the index of the batched ERIs and the final sparse matrix form of the HFX matrix. The performance of this dynamic scalable distributed algorithm has been demonstrated by several examples of large scale hybrid density-functional calculations on Tianhe-2 supercomputers, including both molecular and solid states systems with multiple dimensions, and illustrates good scalability.
Hybrid density-functional calculation is one of the most commonly adopted electronic structure theory used in computational chemistry and materials science because of its balance between accuracy and computational cost. Recently, we have developed a novel scheme called NAO2GTO to achieve linear scaling (Order-N) calculations for hybrid density-functionals. In our scheme, the most time-consuming step is the calculation of the electron repulsion integrals (ERIs) part. So how to create an even distribution of these ERIs in parallel implementation is an issue of particular importance. Here, we present two static scalable distributed algorithms for the ERIs computation. Firstly, the ERIs are distributed over ERIs shell pairs. Secondly, the ERIs is distributed over ERIs shell quartets. In both algorithms, the calculation of ERIs is independent of each other, so the communication time is minimized. We show our speedup results to demonstrate the performance of these static parallel distributed algorithms in the Hefei Order-N packages for textit{ab initio} simulations (HONPAS).