Missing Data Imputation for Supervised Learning

93 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jason Poulos

تاريخ النشر 2016

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Jason Poulos - Rafael Valle

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Missing data imputation can help improve the performance of prediction models in situations where missing data hide useful information. This paper compares methods for imputing missing categorical data for supervised classification tasks. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on non-imputed (i.e., one-hot encoded) or imputed data with different levels of additional missing-data perturbation. We show imputation methods can increase predictive accuracy in the presence of missing-data perturbation, which can actually improve prediction accuracy by regularizing the classifier. We achieve the state-of-the-art on the Adult dataset with missing-data perturbation and k-nearest-neighbors (k-NN) imputation.

قيم البحث

207 - Boris Muzellec , Julie Josse , Claire Boyer 2020

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal tr ansport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

التعلم الالي التعلم الآلي

Imputation estimators for unnormalized models with missing data

103 - Masatoshi Uehara , Takeru Matsuda , Jae Kwang Kim 2019

Several statistical models are given in the form of unnormalized densities, and calculation of the normalization constant is intractable. We propose estimation methods for such unnormalized models with missing data. The key concept is to combine impu tation techniques with estimators for unnormalized models including noise contrastive estimation and score matching. In addition, we derive asymptotic distributions of the proposed estimators and construct confidence intervals. Simulation results with truncated Gaussian graphical models and the application to real data of wind direction reveal that the proposed methods effectively enable statistical inference with unnormalized models from missing data.

التعلم الالي التعلم الآلي المنهجية

Imputation and low-rank estimation with Missing Not At Random data

89 - Aude Sportisse 2018

Missing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values . However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values; a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors.The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase ) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding.

التعلم الالي التعلم الآلي

Scalable Low-Rank Tensor Learning for Spatiotemporal Traffic Data Imputation

84 - Xinyu Chen , Yixian Chen , Nicolas Saunier 2020

Missing value problem in spatiotemporal traffic data has long been a challenging topic, in particular for large-scale and high-dimensional data with complex missing mechanisms and diverse degrees of missingness. Recent studies based on tensor nuclear norm have demonstrated the superiority of tensor learning in imputation tasks by effectively characterizing the complex correlations/dependencies in spatiotemporal data. However, despite the promising results, these approaches do not scale well to large data tensors. In this paper, we focus on addressing the missing data imputation problem for large-scale spatiotemporal traffic data. To achieve both high accuracy and efficiency, we develop a scalable tensor learning model -- Low-Tubal-Rank Smoothing Tensor Completion (LSTC-Tubal) -- based on the existing framework of Low-Rank Tensor Completion, which is well-suited for spatiotemporal traffic data that is characterized by multidimensional structure of location$times$ time of day $times$ day. In particular, the proposed LSTC-Tubal model involves a scalable tensor nuclear norm minimization scheme by integrating linear unitary transformation. Therefore, tensor nuclear norm minimization can be solved by singular value thresholding on the transformed matrix of each day while the day-to-day correlation can be effectively preserved by the unitary transform matrix. We compare LSTC-Tubal with state-of-the-art baseline models, and find that LSTC-Tubal can achieve competitive accuracy with a significantly lower computational cost. In addition, the LSTC-Tubal will also benefit other tasks in modeling large-scale spatiotemporal traffic data, such as network-level traffic forecasting.

التعلم الالي التعلم الآلي

On the consistency of supervised learning with missing values

129 - Julie Josse , Nicolas Prost (CMAP 2019

In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consid er supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data,through multiple imputation.Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the missing incorporated in attribute method as it can handle both non-informative and informative missing values.

التعلم الالي التعلم الآلي نظرية الإحصاء