No Arabic abstract
Determining the aqueous solubility of molecules is a vital step in many pharmaceutical, environmental, and energy storage applications. Despite efforts made over decades, there are still challenges associated with developing a solubility prediction model with satisfactory accuracy for many of these applications. The goal of this study is to develop a general model capable of predicting the solubility of a broad range of organic molecules. Using the largest currently available solubility dataset, we implement deep learning-based models to predict solubility from molecular structure and explore several different molecular representations including molecular descriptors, simplified molecular-input line-entry system (SMILES) strings, molecular graphs, and three-dimensional (3D) atomic coordinates using four different neural network architectures - fully connected neural networks (FCNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and SchNet. We find that models using molecular descriptors achieve the best performance, with GNN models also achieving good performance. We perform extensive error analysis to understand the molecular properties that influence model performance, perform feature analysis to understand which information about molecular structure is most valuable for prediction, and perform a transfer learning and data size study to understand the impact of data availability on model performance.
We present a molecular dynamics simulation method for the computation of the solubility of organic crystals in solution. The solubility is calculated based on the equilibrium free energy difference between the solvated solute and its crystallized state at the crystal surface kink site. In order to efficiently sample the growth and dissolution process, we have carried out well-tempered Metadynamics simulations with a collective variable that captures the slow degrees of freedom, namely the solute diffusion to and adsorption at the kink site together with the desolvation of the kink site. Simulations were performed at different solution concentrations using constant chemical potential molecular dynamics and the solubility was identified at the concentration at which the free energy values between the grown and dissolved kink states were equal. The effectiveness of this method is demonstrated by its success in reproducing the experimental trends of solubility of urea and naphthalene in a variety of solvents.
We report a workflow and the output of a natural language processing (NLP)-based procedure to mine the extant metal-organic framework (MOF) literature describing structurally characterized MOFs and their solvent removal and thermal stabilities. We obtain over 2,000 solvent removal stability measures from text mining and 3,000 thermal decomposition temperatures from thermogravimetric analysis data. We assess the validity of our NLP methods and the accuracy of our extracted data by comparing to a hand-labeled subset. Machine learning (ML, i.e. artificial neural network) models trained on this data using graph- and pore-geometry-based representations enable prediction of stability on new MOFs with quantified uncertainty. Our web interface, MOFSimplify, provides users access to our curated data and enables them to harness that data for predictions on new MOFs. MOFSimplify also encourages community feedback on existing data and on ML model predictions for community-based active learning for improved MOF stability models.
Deep generative models have emerged as a powerful tool for learning informative molecular representations and designing novel molecules with desired properties, with applications in drug discovery and material design. Deep generative auto-encoders defined over molecular SMILES strings have been a popular choice for that purpose. However, capturing salient molecular properties like quantum-chemical energies remains challenging and requires sophisticated neural net models of molecular graphs or geometry-based information. As a simpler and more efficient alternative, we present a SMILES Variational Auto-Encoder (VAE) augmented with topological data analysis (TDA) representations of molecules, known as persistence images. Our experiments show that this TDA augmentation enables a SMILES VAE to capture the complex relation between 3D geometry and electronic properties, and allows generation of novel, diverse, and valid molecules with geometric features consistent with the training data, which exhibit a varying range of global electronic structural properties, such as a small HOMO-LUMO gap - a critical property for designing organic solar cells. We demonstrate that our TDA augmentation yields better success in downstream tasks compared to models trained without these representations and can assist in targeted molecule discovery.
Understanding the star-formation properties of galaxies as a function of cosmic epoch is a critical exercise in studies of galaxy evolution. Traditionally, stellar population synthesis models have been used to obtain best fit parameters that characterise star formation in galaxies. As multiband flux measurements become available for thousands of galaxies, an alternative approach to characterising star formation using machine learning becomes feasible. In this work, we present the use of deep learning techniques to predict three important star formation properties -- stellar mass, star formation rate and dust luminosity. We characterise the performance of our deep learning models through comparisons with outputs from a standard stellar population synthesis code.
Although aqueous electrolytes are among the most important solutions, the molecular simulation of their intertwined properties of chemical potentials, solubility and activity coefficients has remained a challenging problem, and has attracted considerable recent interest. In this perspectives review, we focus on the simplest case of aqueous sodium chloride at ambient conditions and discuss the two main factors that have impeded progress. The first is lack of consensus with respect to the appropriate methodology for force field (FF) development. We examine how most commonly used FFs have been developed, and emphasize the importance of distinguishing between Training Set Properties used to fit the FF parameters, and Test Set Properties, which are pure predictions of additional properties. The second is disagreement among solubility results obtained, even using identical FFs and thermodynamic conditions. Solubility calculations have been approached using both thermodynamic--based methods and direct molecular dynamics--based methods implementing coexisting solution and solid phases. Although convergence has been very recently achieved among results based on the former approach, there is as yet no general agreement with simulation results based on the latter methodology. We also propose a new method to directly calculate the electrolyte standard chemical potential in the Henry-Law ideality model. We conclude by making recommendations for calculating solubility, chemical potentials and activity coefficients, and outline a potential path for future progress.