Semi-supervised Wrapper Feature Selection by Modeling Imperfect Labels

99 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Massih-Reza Amini

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Vasilii Feofanov - Emilie Devijver - Massih-Reza Amini

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we propose a new wrapper feature selection approach with partially labeled training examples where unlabeled observations are pseudo-labeled using the predictions of an initial classifier trained on the labeled training set. The wrapper is composed of a genetic algorithm for proposing new feature subsets, and an evaluation measure for scoring the different feature subsets. The selection of feature subsets is done by assigning weights to characteristics and recursively eliminating those that are irrelevant. The selection criterion is based on a new multi-class $mathcal{C}$-bound that explicitly takes into account the mislabeling errors induced by the pseudo-labeling mechanism, using a probabilistic error model. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised feature selection approaches.

قيم البحث

122 - Shiming Chen , Yisong Wang , Chin-Teng Lin 2018

Data augmentation is usually used by supervised learning approaches for offline writer identification, but such approaches require extra training data and potentially lead to overfitting errors. In this study, a semi-supervised feature learning pipel ine was proposed to improve the performance of writer identification by training with extra unlabeled data and the original labeled data simultaneously. Specifically, we proposed a weighted label smoothing regularization (WLSR) method for data augmentation, which assigned the weighted uniform label distribution to the extra unlabeled data. The WLSR method could regularize the convolutional neural network (CNN) baseline to allow more discriminative features to be learned to represent the properties of different writing styles. The experimental results on well-known benchmark datasets (ICDAR2013 and CVL) showed that our proposed semi-supervised feature learning approach could significantly improve the baseline measurement and perform competitively with existing writer identification approaches. Our findings provide new insights into offline write identification.

التعلم الآلي التعلم الالي

Supervised Feature Subset Selection and Feature Ranking for Multivariate Time Series without Feature Extraction

212 - Shuchu Han , Alexandru Niculescu-Mizil 2020

We introduce supervised feature ranking and feature subset selection algorithms for multivariate time series (MTS) classification. Unlike most existing supervised/unsupervised feature selection algorithms for MTS our techniques do not require a featu re extraction step to generate a one-dimensional feature vector from the time series. Instead it is based on directly computing similarity between individual time series and assessing how well the resulting cluster structure matches the labels. The techniques are amenable to heterogeneous MTS data, where the time series measurements may have different sampling resolutions, and to multi-modal data.

التعلم الآلي التعلم الالي

Semi-Automatic Data Annotation guided by Feature Space Projection

79 - Barbara Caroline Benato , Jancarlo Ferreira Gomes , Alexandrun Cristian Telea 2020

Data annotation using visual inspection (supervision) of each training sample can be laborious. Interactive solutions alleviate this by helping experts propagate labels from a few supervised samples to unlabeled ones based solely on the visual analys is of their feature space projection (with no further sample supervision). We present a semi-automatic data annotation approach based on suitable feature space projection and semi-supervised label estimation. We validate our method on the popular MNIST dataset and on images of human intestinal parasites with and without fecal impurities, a large and diverse dataset that makes classification very hard. We evaluate two approaches for semi-supervised learning from the latent and projection spaces, to choose the one that best reduces user annotation effort and also increases classification accuracy on unseen data. Our results demonstrate the added-value of visual analytics tools that combine complementary abilities of humans and machines for more effective machine learning.

التعلم الآلي التعلم الالي

Feature Selection Methods for Uplift Modeling

325 - Zhenyu Zhao , Yumin Zhang , Totte Harinen 2020

Uplift modeling is a predictive modeling technique that estimates the user-level incremental effect of a treatment using machine learning models. It is often used for targeting promotions and advertisements, as well as for the personalization of prod uct offerings. In these applications, there are often hundreds of features available to build such models. Keeping all the features in a model can be costly and inefficient. Feature selection is an essential step in the modeling process for multiple reasons: improving the estimation accuracy by eliminating irrelevant features, accelerating model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnostics capability. However, feature selection methods for uplift modeling have been rarely discussed in the literature. Although there are various feature selection methods for standard machine learning models, we will demonstrate that those methods are sub-optimal for solving the feature selection problem for uplift modeling. To address this problem, we introduce a set of feature selection methods designed specifically for uplift modeling, including both filter methods and embedded methods. To evaluate the effectiveness of the proposed feature selection methods, we use different uplift models and measure the accuracy of each model with a different number of selected features. We use both synthetic and real data to conduct these experiments. We also implemented the proposed filter methods in an open source Python package (CausalML).

التعلم الآلي المنهجية التعلم الالي

Semi-supervised Neural Networks solve an inverse problem for modeling Covid-19 spread

121 - Alessandro Paticchio , Tommaso Scarlatti , Marios Mattheakis 2020

Studying the dynamics of COVID-19 is of paramount importance to understanding the efficiency of restrictive measures and develop strategies to defend against upcoming contagion waves. In this work, we study the spread of COVID-19 using a semi-supervi sed neural network and assuming a passive part of the population remains isolated from the virus dynamics. We start with an unsupervised neural network that learns solutions of differential equations for different modeling parameters and initial conditions. A supervised method then solves the inverse problem by estimating the optimal conditions that generate functions to fit the data for those infected by, recovered from, and deceased due to COVID-19. This semi-supervised approach incorporates real data to determine the evolution of the spread, the passive population, and the basic reproduction number for different countries.

التعلم الآلي التعلم الالي