Retrieval of Coloured Dissolved Organic Matter with Machine Learning Methods

67 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ana Belen Ruescas Orient

تاريخ النشر 2021

مجال البحث فيزياء الهندسة المعلوماتية

والبحث باللغة English

تأليف Ana B. Ruescas - Martin Hieronymi - Sampsa Koponen

الجيوفيزياء التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The coloured dissolved organic matter (CDOM) concentration is the standard measure of humic substance in natural waters. CDOM measurements by remote sensing is calculated using the absorption coefficient (a) at a certain wavelength (e.g. 440nm). This paper presents a comparison of four machine learning methods for the retrieval of CDOM from remote sensing signals: regularized linear regression (RLR), random forest (RF), kernel ridge regression (KRR) and Gaussian process regression (GPR). Results are compared with the established polynomial regression algorithms. RLR is revealed as the simplest and most efficient method, followed closely by its nonlinear counterpart KRR.

قيم البحث

271 - Davide Piras , Alessio Spurio Mancini , Benjamin Joachimi 2021

Bayesian inference applied to microseismic activity monitoring allows for principled estimation of the coordinates of microseismic events from recorded seismograms, and their associated uncertainties. However, forward modelling of these microseismic events, necessary to perform Bayesian source inversion, can be prohibitively expensive in terms of computational resources. A viable solution is to train a surrogate model based on machine learning techniques, to emulate the forward model and thus accelerate Bayesian inference. In this paper, we improve on previous work, which considered only sources with isotropic moment tensor. We train a machine learning algorithm on the power spectrum of the recorded pressure wave and show that the trained emulator allows for the complete and fast retrieval of the event coordinates for $textit{any}$ source mechanism. Moreover, we show that our approach is computationally inexpensive, as it can be run in less than 1 hour on a commercial laptop, while yielding accurate results using less than $10^4$ training seismograms. We additionally demonstrate how the trained emulators can be used to identify the source mechanism through the estimation of the Bayesian evidence. This work lays the foundations for the efficient localisation and characterisation of any recorded seismogram, thus helping to quantify human impact on seismic activity and mitigate seismic hazard.

الجيوفيزياء التعلم الآلي تحليل البيانات والإحصاءات والاحتمال

MOFSimplify: Machine Learning Models with Extracted Stability Data of Three Thousand Metal-Organic Frameworks

284 - A. Nandy , G. Terrones , N. Arunachalam 2021

We report a workflow and the output of a natural language processing (NLP)-based procedure to mine the extant metal-organic framework (MOF) literature describing structurally characterized MOFs and their solvent removal and thermal stabilities. We ob tain over 2,000 solvent removal stability measures from text mining and 3,000 thermal decomposition temperatures from thermogravimetric analysis data. We assess the validity of our NLP methods and the accuracy of our extracted data by comparing to a hand-labeled subset. Machine learning (ML, i.e. artificial neural network) models trained on this data using graph- and pore-geometry-based representations enable prediction of stability on new MOFs with quantified uncertainty. Our web interface, MOFSimplify, provides users access to our curated data and enables them to harness that data for predictions on new MOFs. MOFSimplify also encourages community feedback on existing data and on ML model predictions for community-based active learning for improved MOF stability models.

علم المواد التعلم الآلي الفيزياء الكيميائية

A Data-driven feature selection and machine-learning model benchmark for the prediction of longitudinal dispersion coefficient

108 - Yifeng Zhao , Pei Zhang , S.A. Galindo-Torres 2021

Longitudinal Dispersion(LD) is the dominant process of scalar transport in natural streams. An accurate prediction on LD coefficient(Dl) can produce a performance leap in related simulation. The emerging machine learning(ML) techniques provide a self -adaptive tool for this problem. However, most of the existing studies utilize an unproved quaternion feature set, obtained through simple theoretical deduction. Few studies have put attention on its reliability and rationality. Besides, due to the lack of comparative comparison, the proper choice of ML models in different scenarios still remains unknown. In this study, the Feature Gradient selector was first adopted to distill the local optimal feature sets directly from multivariable data. Then, a global optimal feature set (the channel width, the flow velocity, the channel slope and the cross sectional area) was proposed through numerical comparison of the distilled local optimums in performance with representative ML models. The channel slope is identified to be the key parameter for the prediction of LDC. Further, we designed a weighted evaluation metric which enables comprehensive model comparison. With the simple linear model as the baseline, a benchmark of single and ensemble learning models was provided. Advantages and disadvantages of the methods involved were also discussed. Results show that the support vector machine has significantly better performance than other models. Decision tree is not suitable for this problem due to poor generalization ability. Notably, simple models show superiority over complicated model on this low-dimensional problem, for their better balance between regression and generalization.

الجيوفيزياء التعلم الآلي

MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning

465 - Tariq Alkhalifah , Hanchen Wang , Oleg Ovcharenko 2021

Among the biggest challenges we face in utilizing neural networks trained on waveform data (i.e., seismic, electromagnetic, or ultrasound) is its application to real data. The requirement for accurate labels forces us to develop solutions using synth etic data, where labels are readily available. However, synthetic data often do not capture the reality of the field/real experiment, and we end up with poor performance of the trained neural network (NN) at the inference stage. We describe a novel approach to enhance supervised training on synthetic data with real data features (domain adaptation). Specifically, for tasks in which the absolute values of the vertical axis (time or depth) of the input data are not crucial, like classification, or can be corrected afterward, like velocity model building using a well-log, we suggest a series of linear operations on the input so the training and application data have similar distributions. This is accomplished by applying two operations on the input data to the NN model: 1) The crosscorrelation of the input data (i.e., shot gather, seismic image, etc.) with a fixed reference trace from the same dataset. 2) The convolution of the resulting data with the mean (or a random sample) of the autocorrelated data from another domain. In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain, and random samples from real data are drawn at every training epoch. In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain. Example applications on passive seismic data for microseismic event source location determination and active seismic data for predicting low frequencies are used to demonstrate the power of this approach in improving the applicability of trained models to real data.

الجيوفيزياء التعلم الآلي معالجة الإشارات

Directivity Modes of Earthquake Populations with Unsupervised Learning

362 - Zachary E. Ross , Daniel T. Trugman , Kamyar Azizzadenesheli 2019

We present a novel approach for resolving modes of rupture directivity in large populations of earthquakes. A seismic spectral decomposition technique is used to first produce relative measurements of radiated energy for earthquakes in a spatially-co mpact cluster. The azimuthal distribution of energy for each earthquake is then assumed to result from one of several distinct modes of rupture propagation. Rather than fitting a kinematic rupture model to determine the most likely mode of rupture propagation, we instead treat the modes as latent variables and learn them with a Gaussian mixture model. The mixture model simultaneously determines the number of events that best identify with each mode. The technique is demonstrated on four datasets in California with several thousand earthquakes. We show that the datasets naturally decompose into distinct rupture propagation modes that correspond to different rupture directions, and the fault plane is unambiguously identified for all cases. We find that these small earthquakes exhibit unilateral ruptures 53-74% of the time on average. The results provide important observational constraints on the physics of earthquakes and faults.

الجيوفيزياء التعلم الآلي