No Arabic abstract
We introduce QM7-X, a comprehensive dataset of 42 physicochemical properties for $approx$ 4.2 M equilibrium and non-equilibrium structures of small organic molecules with up to seven non-hydrogen (C, N, O, S, Cl) atoms. To span this fundamentally important region of chemical compound space (CCS), QM7-X includes an exhaustive sampling of (meta-)stable equilibrium structures - comprised of constitutional/structural isomers and stereoisomers, e.g., enantiomers and diastereomers (including cis-/trans- and conformational isomers) - as well as 100 non-equilibrium structural variations thereof to reach a total of $approx$ 4.2 M molecular structures. Computed at the tightly converged quantum-mechanical PBE0+MBD level of theory, QM7-X contains global (molecular) and local (atom-in-a-molecule) properties ranging from ground state quantities (such as atomization energies and dipole moments) to response quantities (such as polarizability tensors and dispersion coefficients). By providing a systematic, extensive, and tightly-converged dataset of quantum-mechanically computed physicochemical properties, we expect that QM7-X will play a critical role in the development of next-generation machine-learning based models for exploring greater swaths of CCS and performing in silico design of molecules with targeted properties.
A key challenge in automated chemical compound space explorations is ensuring veracity in minimum energy geometries---to preserve intended bonding connectivities. We discuss an iterative high-throughput workflow for connectivity preserving geometry optimizations exploiting the nearness between quantum mechanical models. The methodology is benchmarked on the QM9 dataset comprising DFT-level properties of 133,885 small molecules; of which 3,054 have questionable geometric stability. We successfully troubleshoot 2,988 molecules and ensure a bijective mapping between desired Lewis formulae and final geometries. Our workflow, based on DFT and post-DFT methods, identifies 66 molecules as unstable; 52 contain $-{rm NNO}-$, the rest are strained due to pyramidal sp$^2$ C. In the curated dataset, we inspect molecules with long CC bonds and identify ultralong contestants ($r>1.70$~AA{}) supported by topological analysis of electron density. We hope the proposed strategy to play a role in big data quantum chemistry initiatives.
We report on the largest dataset of optimized molecular geometries and electronic properties calculated by the PM6 method for 92.9% of the 91.2 million molecules cataloged in PubChem Compounds retrieved on Aug. 29, 2016. In addition to neutral states, we also calculated those for cationic, anionic, and spin flipped electronic states of 56.2%, 49.7%, and 41.3% of the molecules, respectively. Thus, the grand total calculated is 221 million molecules. The dataset is available at http://pubchemqc.riken.jp/pm6_dataset.html under the Creative Commons Attribution 4.0 International license.
(Semi)-local density functional approximations (DFAs) suffer from self-interaction error (SIE). When the first ionization energy (IE) is computed as the negative of the highest-occupied orbital (HO) eigenvalue, DFAs notoriously underestimate them compared to quasi-particle calculations. The inaccuracy for the HO is attributed to SIE inherent in DFAs. We assessed the IE based on Perdew-Zunger self-interaction corrections on 14 small to moderate-sized organic molecules relevant in organic electronics and polymer donor materials. Though self-interaction corrected DFAs were found to significantly improve the IE relative to the uncorrected DFAs, they overestimate. However, when the self-interaction correction is interiorly scaled using a function of the iso-orbital indicator z{sigma}, only the regions where SIE is significant get a correction. We discuss these approaches and show how these methods significantly improve the description of the HO eigenvalue for the organic molecules.
Recent studies illustrate how machine learning (ML) can be used to bypass a core challenge of molecular modeling: the tradeoff between accuracy and computational cost. Here, we assess multiple ML approaches for predicting the atomization energy of organic molecules. Our resulting models learn the difference between low-fidelity, B3LYP, and high-accuracy, G4MP2, atomization energies, and predict the G4MP2 atomization energy to 0.005 eV (mean absolute error) for molecules with less than 9 heavy atoms and 0.012 eV for a small set of molecules with between 10 and 14 heavy atoms. Our two best models, which have different accuracy/speed tradeoffs, enable the efficient prediction of G4MP2-level energies for large molecules and are available through a simple web interface.
Radical pair recombination reactions are known to be sensitive to extremely weak magnetic fields, and can therefore be said to function as molecular magnetoreceptors. The classic example is a carotenoid-porphyrin-fullerene (C+PF-) radical pair that has been shown to provide a proof-of-principle for the operation of a chemical compass [K. Maeda et al., Nature 453, 387 (2008)]. Previous simulations of this radical pair have employed semiclassical approximations, which are routinely applicable to its 47 coupled electronic and nuclear spins. However, calculating the exact quantum mechanical spin dynamics presents a significant challenge, and has not been possible before now. Here we use a recently developed method to perform numerically converged simulations of the C+PF- quantum mechanical spin dynamics, including all coupled spins. Comparison of these quantum mechanical simulations with various semiclassical approximations reveals that, while it is not perfect, the best semiclassical approximation does capture essentially all of the relevant physics in this problem.