PMLB v1.0: An open source dataset collection for benchmarking machine learning methods

179 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Trang Le

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Joseph D. Romano - Trang T. Le - William La Cava

التعلم الآلي قواعد البيانات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Motivation: Novel machine learning and statistical modeling studies rely on standardized comparisons to existing methods using well-studied benchmark datasets. Few tools exist that provide rapid access to many of these datasets through a standardized, user-friendly interface that integrates well with popular data science workflows. Results: This release of PMLB provides the largest collection of diverse, public benchmark datasets for evaluating new machine learning and data science methods aggregated in one location. v1.0 introduces a number of critical improvements developed following discussions with the open-source community. Availability: PMLB is available at https://github.com/EpistasisLab/pmlb. Python and R interfaces for PMLB can be installed through the Python Package Index and Comprehensive R Archive Network, respectively.

قيم البحث

140 - Lyle Regenwetter , Brent Curry , Faez Ahmed 2021

In this paper, we present BIKED, a dataset comprised of 4500 individually designed bicycle models sourced from hundreds of designers. We expect BIKED to enable a variety of data-driven design applications for bicycles and support the development of d ata-driven design methods. The dataset is comprised of a variety of design information including assembly images, component images, numerical design parameters, and class labels. In this paper, we first discuss the processing of the dataset, then highlight some prominent research questions that BIKED can help address. Of these questions, we further explore the following in detail: 1) Are there prominent gaps in the current bicycle market and design space? We explore the design space using unsupervised dimensionality reduction methods. 2) How does one identify the class of a bicycle and what factors play a key role in defining it? We address the bicycle classification task by training a multitude of classifiers using different forms of design data and identifying parameters of particular significance through permutation-based interpretability analysis. 3) How does one synthesize new bicycles using different representation methods? We consider numerous machine learning methods to generate new bicycle models as well as interpolate between and extrapolate from existing models using Variational Autoencoders. The dataset and code are available at http://decode.mit.edu/projects/biked/.

التعلم الآلي قواعد البيانات التعلم الالي

OpenFL: An open-source framework for Federated Learning

121 - G Anthony Reina , Alexey Gruzdev , Patrick Foley 2021

Federated learning (FL) is a computational paradigm that enables organizations to collaborate on machine learning (ML) projects without sharing sensitive data, such as, patient records, financial data, or classified secrets. Open Federated Learning ( OpenFL https://github.com/intel/openfl) is an open-source framework for training ML algorithms using the data-private collaborative learning paradigm of FL. OpenFL works with training pipelines built with both TensorFlow and PyTorch, and can be easily extended to other ML and deep learning frameworks. Here, we summarize the motivation and development characteristics of OpenFL, with the intention of facilitating its application to existing ML model training in a production environment. Finally, we describe the first use of the OpenFL framework to train consensus ML models in a consortium of international healthcare organizations, as well as how it facilitates the first computational competition on FL.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

212 - Farah Fahim , Benjamin Hawks , Christian Herwig 2021

Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drasticall y improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate machine learning algorithms for implementation with both FPGA and ASIC technologies. We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations and increased usability: new Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long pipeline kernels for low power, and new device backends include an ASIC workflow. Taken together, these and continued efforts in hls4ml will arm a new generation of domain scientists with accessible, efficient, and powerful tools for machine-learning-accelerated discovery.

التعلم الآلي هندسة العتاد أجهزة الكشف الفيزيائية

ProcK: Machine Learning for Knowledge-Intensive Processes

254 - Tobias Jacobs , Jingyi Yu , Julia Gastinger 2021

Process mining deals with extraction of knowledge from business process execution logs. Traditional process mining tasks, like process model generation or conformance checking, rely on a minimalistic feature set where each event is characterized only by its case identifier, activity type, and timestamp. In contrast, the success of modern machine learning is based on models that take any available data as direct input and build layers of features automatically during training. In this work, we introduce ProcK (Process & Knowledge), a novel pipeline to build business process prediction models that take into account both sequential data in the form of event logs and rich semantic information represented in a graph-structured knowledge base. The hybrid approach enables ProcK to flexibly make use of all information residing in the databases of organizations. Components to extract inter-linked event logs and knowledge bases from relational databases are part of the pipeline. We demonstrate the power of ProcK by training it for prediction tasks on the OULAD e-learning dataset, where we achieve state-of-the-art performance on the tasks of predicting student dropout from courses and predicting their success. We also apply our method on a number of additional machine learning tasks, including exam score prediction and early predictions that only take into account data recorded during the first weeks of the courses.

التعلم الآلي قواعد البيانات الحوسبة العصبية والتطورية

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

154 - Joris Cosentino , Manuel Pariente , Samuele Cornell 2020

In recent years, wsj0-2mix has become the reference dataset for single-channel speech separation. Most deep learning-based speech separation models today are benchmarked on it. However, recent studies have shown important performance drops when model s trained on wsj0-2mix are evaluated on other, similar datasets. To address this generalization issue, we created LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension, WHAM!. Based on LibriSpeech, LibriMix consists of two- or three-speaker mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we achieve competitive performance on all LibriM

معالجة الصوت والكلام