Independent Vector Analysis for Data Fusion Prior to Molecular Property Prediction with Machine Learning

215 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zois Boukouvalas

تاريخ النشر 2018

مجال البحث الاحصاء الرياضي فيزياء

والبحث باللغة English

تأليف Zois Boukouvalas - Daniel C. Elton - Peter W. Chung

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Due to its high computational speed and accuracy compared to ab-initio quantum chemistry and forcefield modeling, the prediction of molecular properties using machine learning has received great attention in the fields of materials design and drug discovery. A main ingredient required for machine learning is a training dataset consisting of molecular featurestextemdash for example fingerprint bits, chemical descriptors, etc. that adequately characterize the corresponding molecules. However, choosing features for any application is highly non-trivial. No universal method for feature selection exists. In this work, we propose a data fusion framework that uses Independent Vector Analysis to exploit underlying complementary information contained in different molecular featurization methods, bringing us a step closer to automated feature generation. Our approach takes an arbitrary number of individual feature vectors and automatically generates a single, compact (low dimensional) set of molecular features that can be used to enhance the prediction performance of regression models. At the same time our methodology retains the possibility of interpreting the generated features to discover relationships between molecular structures and properties. We demonstrate this on the QM7b dataset for the prediction of several properties such as atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity, and excitation energies. In addition, we show how our method helps improve the prediction of experimental binding affinities for a set of human BACE-1 inhibitors. To encourage more widespread use of IVA we have developed the PyIVA Python package, an open source code which is available for download on Github.

قيم البحث

65 - Rogers F. Silva , Sergey M. Plis (1 2019

In the last two decades, unsupervised latent variable models---blind source separation (BSS) especially---have enjoyed a strong reputation for the interpretable features they produce. Seldom do these models combine the rich diversity of information a vailable in multiple datasets. Multidatasets, on the other hand, yield joint solutions otherwise unavailable in isolation, with a potential for pivotal insights into complex systems. To take advantage of the complex multidimensional subspace structures that capture underlying modes of shared and unique variability across and within datasets, we present a direct, principled approach to multidataset combination. We design a new method called multidataset independent subspace analysis (MISA) that leverages joint information from multiple heterogeneous datasets in a flexible and synergistic fashion. Methodological innovations exploiting the Kotz distribution for subspace modeling in conjunction with a novel combinatorial optimization for evasion of local minima enable MISA to produce a robust generalization of independent component analysis (ICA), independent vector analysis (IVA), and independent subspace analysis (ISA) in a single unified model. We highlight the utility of MISA for multimodal information fusion, including sample-poor regimes and low signal-to-noise ratio scenarios, promoting novel applications in both unimodal and multimodal brain imaging data.

التعلم الالي التعلم الآلي معالجة الصور والفيديو

Why Interpretability in Machine Learning? An Answer Using Distributed Detection and Data Fusion Theory

322 - Kush R. Varshney , Prashant Khanduri , Pranay Sharma 2018

As artificial intelligence is increasingly affecting all parts of society and life, there is growing recognition that human interpretability of machine learning models is important. It is often argued that accuracy or other similar generalization per formance metrics must be sacrificed in order to gain interpretability. Such arguments, however, fail to acknowledge that the overall decision-making system is composed of two entities: the learned model and a human who fuses together model outputs with his or her own information. As such, the relevant performance criteria should be for the entire system, not just for the machine learning component. In this work, we characterize the performance of such two-node tandem data fusion systems using the theory of distributed detection. In doing so, we work in the population setting and model interpretable learned models as multi-level quantizers. We prove that under our abstraction, the overall system of a human with an interpretable classifier outperforms one with a black box classifier.

التعلم الالي نظرية المعلومات التعلم الآلي

Structure-Property Maps with Kernel Principal Covariates Regression

69 - Benjamin A. Helfrecht , Rose K. Cersonsky , Guillaume Fraux 2020

Data analyses based on linear methods constitute the simplest, most robust, and transparent approaches to the automatic processing of large amounts of data for building supervised or unsupervised machine learning models. Principal covariates regressi on (PCovR) is an underappreciated method that interpolates between principal component analysis and linear regression, and can be used to conveniently reveal structure-property relations in terms of simple-to-interpret, low-dimensional maps. Here we provide a pedagogic overview of these data analysis schemes, including the use of the kernel trick to introduce an element of non-linearity, while maintaining most of the convenience and the simplicity of linear approaches. We then introduce a kernelized version of PCovR and a sparsified extension, and demonstrate the performance of this approach in revealing and predicting structure-property relations in chemistry and materials science, showing a variety of examples including elemental carbon, porous silicate frameworks, organic molecules, amino acid conformers, and molecular materials.

التعلم الالي علم المواد التعلم الآلي

Independent Innovation Analysis for Nonlinear Vector Autoregressive Process

63 - Hiroshi Morioka , Hermanni Halva , Aapo Hyvarinen 2020

The nonlinear vector autoregressive (NVAR) model provides an appealing framework to analyze multivariate time series obtained from a nonlinear dynamical system. However, the innovation (or error), which plays a key role by driving the dynamics, is al most always assumed to be additive. Additivity greatly limits the generality of the model, hindering analysis of general NVAR processes which have nonlinear interactions between the innovations. Here, we propose a new general framework called independent innovation analysis (IIA), which estimates the innovations from completely general NVAR. We assume mutual independence of the innovations as well as their modulation by an auxiliary variable (which is often taken as the time index and simply interpreted as nonstationarity). We show that IIA guarantees the identifiability of the innovations with arbitrary nonlinearities, up to a permutation and component-wise invertible nonlinearities. We also propose three estimation frameworks depending on the type of the auxiliary variable. We thus provide the first rigorous identifiability result for general NVAR, as well as very general tools for learning such models.

التعلم الالي التعلم الآلي

Support Vector Machine Application for Multiphase Flow Pattern Prediction

81 - Pablo Guillen-Rondon , Melvin D. Robinson , Carlos Torres 2018

In this paper a data analytical approach featuring support vector machines (SVM) is employed to train a predictive model over an experimentaldataset, which consists of the most relevant studies for two-phase flow pattern prediction. The database for this study consists of flow patterns or flow regimes in gas-liquid two-phase flow. The term flow pattern refers to the geometrical configuration of the gas and liquid phases in the pipe. When gas and liquid flow simultaneously in a pipe, the two phases can distribute themselves in a variety of flow configurations. Gas-liquid two-phase flow occurs ubiquitously in various major industrial fields: petroleum, chemical, nuclear, and geothermal industries. The flow configurations differ from each other in the spatial distribution of the interface, resulting in different flow characteristics. Experimental results obtained by applying the presented methodology to different combinations of flow patterns demonstrate that the proposed approach is state-of-the-art alternatives by achieving 97% correct classification. The results suggest machine learning could be used as an effective tool for automatic detection and classification of gas-liquid flow patterns.

التعلم الالي التعلم الآلي