A Data-driven feature selection and machine-learning model benchmark for the prediction of longitudinal dispersion coefficient

109 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yifeng Zhao

تاريخ النشر 2021

مجال البحث فيزياء الهندسة المعلوماتية

والبحث باللغة English

تأليف Yifeng Zhao - Pei Zhang - S.A. Galindo-Torres

الجيوفيزياء التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Longitudinal Dispersion(LD) is the dominant process of scalar transport in natural streams. An accurate prediction on LD coefficient(Dl) can produce a performance leap in related simulation. The emerging machine learning(ML) techniques provide a self-adaptive tool for this problem. However, most of the existing studies utilize an unproved quaternion feature set, obtained through simple theoretical deduction. Few studies have put attention on its reliability and rationality. Besides, due to the lack of comparative comparison, the proper choice of ML models in different scenarios still remains unknown. In this study, the Feature Gradient selector was first adopted to distill the local optimal feature sets directly from multivariable data. Then, a global optimal feature set (the channel width, the flow velocity, the channel slope and the cross sectional area) was proposed through numerical comparison of the distilled local optimums in performance with representative ML models. The channel slope is identified to be the key parameter for the prediction of LDC. Further, we designed a weighted evaluation metric which enables comprehensive model comparison. With the simple linear model as the baseline, a benchmark of single and ensemble learning models was provided. Advantages and disadvantages of the methods involved were also discussed. Results show that the support vector machine has significantly better performance than other models. Decision tree is not suitable for this problem due to poor generalization ability. Notably, simple models show superiority over complicated model on this low-dimensional problem, for their better balance between regression and generalization.

قيم البحث

78 - Yifeng Zhao , Zicheng Liu , Pei Zhang 2021

A better understanding of dispersion in natural streams requires knowledge of longitudinal dispersion coefficient(LDC). Various methods have been proposed for predictions of LDC. Those studies can be grouped into three types: analytical, statistical and ML-driven researches(Implicit and explicit). However, a comprehensive evaluation of them is still lacking. In this paper, we first present an in-depth analysis of those methods and find out their defects. This is carried out on an extensive database composed of 660 samples of hydraulic and channel properties worldwide. The reliability and representativeness of utilized data are enhanced through the deployment of the Subset Selection of Maximum Dissimilarity(SSMD) for testing set selection and the Inter Quartile Range(IQR) for removal of the outlier. The evaluation reveals the rank of those methods as: ML-driven method > the statistical method > the analytical method. Whereas implicit ML-driven methods are black-boxes in nature, explicit ML-driven methods have more potential in prediction of LDC. Besides, overfitting is a universal problem in existing models. Those models also suffer from a fixed parameter combination. To establish an interpretable model for LDC prediction with higher performance, we then design a novel symbolic regression method called evolutionary symbolic regression network(ESRN). It is a combination of genetic algorithms and neural networks. Strategies are introduced to avoid overfitting and explore more parameter combinations. Results show that the ESRN model has superiorities over other existing symbolic models in performance. The proposed model is suitable for practical engineering problems due to its advantage in low requirement of parameters (only w and U* are required). It can provide convincing solutions for situations where the field test cannot be carried out or limited field information can be obtained.

التعلم الآلي تحليل البيانات والإحصاءات والاحتمال

Learning Dynamic Feature Selection for Fast Sequential Prediction

390 - Emma Strubell , Luke Vilnis , Kate Silverstein 2015

We present paired learning and inference algorithms for significantly reducing computation and increasing speed of the vector dot products in the classifiers that are at the heart of many NLP components. This is accomplished by partitioning the featu res into a sequence of templates which are ordered such that high confidence can often be reached using only a small fraction of all features. Parameter estimation is arranged to maximize accuracy and early confidence in this sequence. Our approach is simpler and better suited to NLP than other related cascade methods. We present experiments in left-to-right part-of-speech tagging, named entity recognition, and transition-based dependency parsing. On the typical benchmarking datasets we can preserve POS tagging accuracy above 97% and parsing LAS above 88.5% both with over a five-fold reduction in run-time, and NER F1 above 88 with more than 2x increase in speed.

الحساب واللغة التعلم الآلي

Dispersion Characterization and Pulse Prediction with Machine Learning

385 - Sanjaya Lohani , Erin M. Knutson , Wenlei Zhang 2019

In this work we demonstrate the efficacy of neural networks in the characterization of dispersive media. We also develop a neural network to make predictions for input probe pulses which propagate through a nonlinear dispersive medium, which may be a pplied to predicting optimal pulse shapes for a desired output. The setup requires only a single pulse for the probe, providing considerable simplification of the current method of dispersion characterization that requires frequency scanning across the entirety of the gain and absorption features. We show that the trained networks are able to predict pulse profiles as well as dispersive features that are nearly identical to their experimental counterparts. We anticipate that the use of machine learning in conjunction with optical communication and sensing methods, both classical and quantum, can provide signal enhancement and experimental simplifications even in the face of highly complex, layered nonlinear light-matter interactions.

بصريات التعلم الآلي معالجة الإشارات

Earthquake Detection in 1-D Time Series Data with Feature Selection and Dictionary Learning

296 - Zheng Zhou , Youzuo Lin , Zhongping Zhang 2018

Earthquakes can be detected by matching spatial patterns or phase properties from 1-D seismic waves. Current earthquake detection methods, such as waveform correlation and template matching, have difficulty detecting anomalous earthquakes that are no t similar to other earthquakes. In recent years, machine-learning techniques for earthquake detection have been emerging as a new active research direction. In this paper, we develop a novel earthquake detection method based on dictionary learning. Our detection method first generates rich features via signal processing and statistical methods and further employs feature selection techniques to choose features that carry the most significant information. Based on these selected features, we build a dictionary for classifying earthquake events from non-earthquake events. To evaluate the performance of our dictionary-based detection methods, we test our method on a labquake dataset from Penn State University, which contains 3,357,566 time series data points with a 400 MHz sampling rate. 1,000 earthquake events are manually labeled in total, and the length of these earthquake events varies from 74 to 7151 data points. Through comparison to other detection methods, we show that our feature selection and dictionary learning incorporated earthquake detection method achieves an 80.1% prediction accuracy and outperforms the baseline methods in earthquake detection, including Template Matching (TM) and Support Vector Machine (SVM).

الجيوفيزياء

Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform

78 - Zhenyu Zhao , Radhika Anand , Mallory Wang 2019

In machine learning applications for online product offerings and marketing strategies, there are often hundreds or thousands of features available to build such models. Feature selection is one essential method in such applications for multiple obje ctives: improving the prediction accuracy by eliminating irrelevant features, accelerating the model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnosis capability. However, selecting an optimal feature subset from a large feature space is considered as an NP-complete problem. The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this problem by selecting the relevant features while controlling for the redundancy within the selected features. This paper describes the approach to extend, evaluate, and implement the mRMR feature selection methods for classification problem in a marketing machine learning platform at Uber that automates creation and deployment of targeting and personalization models at scale. This study first extends the existing mRMR methods by introducing a non-linear feature redundancy measure and a model-based feature relevance measure. Then an extensive empirical evaluation is performed for eight different feature selection methods, using one synthetic dataset and three real-world marketing datasets at Uber to cover different use cases. Based on the empirical results, the selected mRMR method is implemented in production for the marketing machine learning platform. A description of the production implementation is provided and an online experiment deployed through the platform is discussed.

التعلم الالي التعلم الآلي