Ease.ml/snoopy: Towards Automatic Feasibility Study for Machine Learning Applications

نشر في Cedric Renggli بتاريخ 2020 في مجال الهندسة المعلوماتية والبحث باللغة English تحميل البحث

الملخص بالإنكليزية

In our experience of working with domain experts who are using todays AutoML systems, a common problem we encountered is what we call unrealistic expectations -- when users are facing a very challenging task with noisy data acquisition process, whilst being expected to achieve startlingly high accuracy with machine learning (ML). Consequently, many computationally expensive AutoML runs and labour-intensive ML development processes are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In this paper, we present ease.ml/snoopy with the goal of preforming an automatic feasibility study before building ML applications or collecting too many samples. A user provides inputs in the form of a dataset, which is representative for the task and data acquisition process, and a quality target (e.g., expected accuracy > 0.8). The system returns its deduction on whether this target is achievable using ML given the input data. We approach this problem by estimating the irreducible error of the underlying task, also known as Bayes error. The technical key contribution of this work is the design of a practical Bayes error estimator. We carefully evaluate the benefits and limitations of running ease.ml/snoopy prior to training ML models on too noisy datasets for reaching the desired target accuracy. By including the automatic feasibility study into the iterative label cleaning process, users are able to save substantial labeling time and monetary efforts.

تحميل البحث