أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Yinjun Wu

CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties (Technical Report)

59 - Yinjun Wu , James Weimer , Susan B. Davidson 2021

High-quality labels are expensive to obtain for many machine learning tasks, such as medical image classification tasks. Therefore, probabilistic (weak) labels produced by weak supervision tools are used to seed a process in which influential samples with weak labels are identified and cleaned by several human annotators to improve the model performance. To lower the overall cost and computational overhead of this process, we propose a solution called CHEF (CHEap and Fast label cleaning), which consists of the following three components. First, to reduce the cost of human annotators, we use Infl, which prioritizes the most influential training samples for cleaning and provides cleaned labels to save the cost of one human annotator. Second, to accelerate the sample selector phase and the model constructor phase, we use Increm-Infl to incrementally produce influential samples, and DeltaGrad-L to incrementally update the model. Third, we redesign the typical label cleaning pipeline so that human annotators iteratively clean smaller batch of samples rather than one big batch of samples. This yields better over all model performance and enables possible early termination when the expected model performance has been achieved. Extensive experiments show that our approach gives good model prediction performance while achieving significant speed-ups.

قواعد البيانات التعلم الآلي

Dynamic Gaussian Mixture based Deep Generative Model For Robust Forecasting on Sparse Multivariate Time Series

137 - Yinjun Wu , Jingchao Ni , Wei Cheng 2021

Forecasting on sparse multivariate time series (MTS) aims to model the predictors of future values of time series given their incomplete past, which is important for many emerging applications. However, most existing methods process MTSs individually , and do not leverage the dynamic distributions underlying the MTSs, leading to sub-optimal results when the sparsity is high. To address this challenge, we propose a novel generative model, which tracks the transition of latent clusters, instead of isolated feature representations, to achieve robust modeling. It is characterized by a newly designed dynamic Gaussian mixture distribution, which captures the dynamics of clustering structures, and is used for emitting timeseries. The generative model is parameterized by neural networks. A structured inference network is also designed for enabling inductive analysis. A gating mechanism is further introduced to dynamically tune the Gaussian mixture distributions. Extensive experimental results on a variety of real-life datasets demonstrate the effectiveness of our method.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

DeltaGrad: Rapid retraining of machine learning models

162 - Yinjun Wu , Edgar Dobriban , Susan B. Davidson 2020

Machine learning models are not static and may need to be retrained on slightly changed datasets, for instance, with the addition or deletion of a set of data points. This has many applications, including privacy, robustness, bias reduction, and unce rtainty quantifcation. However, it is expensive to retrain models from scratch. To address this problem, we propose the DeltaGrad algorithm for rapid retraining machine learning models based on information cached during the training phase. We provide both theoretical and empirical support for the effectiveness of DeltaGrad, and show that it compares favorably to the state of the art.

التعلم الآلي التعلم الالي

Lessons learned from the early performance evaluation of Intel Optane DC Persistent Memory in DBMS

253 - Yinjun Wu , Kwanghyun Park , Rathijit Sen 2020

Non-volatile memory (NVM) is an emerging technology, which has the persistence characteristics of large capacity storage devices(e.g., HDDs and SSDs), while providing the low access latency and byte-addressablity of traditional DRAM memory. This uniq ue combination of features open up several new design considerations when building database management systems (DBMSs), such as replacing DRAM (as the main working space memory) or block devices (as the persistent storage), or complementing both at the same time for several DBMS components (such as access methods,storage engine, buffer management, logging/recovery, etc). However, interacting with NVM requires changes to application software to best use the device (e.g. mmap and clflush of small cache lines instead of write and fsync of large page buffers). Before introducing (potentially major) code changes to the DBMS for NVM, developers need a clear understanding of NVM performance in various conditions to help make better design choices. In this paper, we provide extensive performance evaluations conducted with a recently released NVM device, Intel Optane DC Persistent Memory (PMem), under different configurations with several micro-benchmark tools. Further, we evaluate OLTP and OLAP database workloads (i.e., TPC-C and TPC-H) with Microsoft SQL Server 2019 when using the NVM device as an in-memory buffer pool or persistent storage. From the lessons learned we share some recommendations for future DBMS design with PMem, e.g.simple hardware or software changes are not enough for the best use of PMem in DBMSs.

قواعد البيانات

PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models

59 - Yinjun Wu , Val Tannen , Susan B. Davidson 2020

The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update. Much effort is being put in better understanding and debugging machine learning models, as well as in identifyin g and repairing errors in training datasets. Our focus is on how to assist these activities when they have to retrain the machine learning model after removing problematic training samples in cleaning or selecting different subsets of training data for interpretability. This paper presents an efficient provenance-based approach, PrIU, and its optimized version, PrIU-opt, for incrementally updating model parameters without sacrificing prediction accuracy. We prove the correctness and convergence of the incrementally updated model parameters, and validate it experimentally. Experimental results show that up to two orders of magnitude speed-ups can be achieved by PrIU-opt compared to simply retraining the model from scratch, yet obtaining highly similar models.

التعلم الآلي قواعد البيانات التعلم الالي

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد