Synthetic Event Time Series Health Data Generation

89 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Saloni Dash

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Saloni Dash - Ritik Dutta - Isabelle Guyon

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Synthetic medical data which preserves privacy while maintaining utility can be used as an alternative to real medical data, which has privacy costs and resource constraints associated with it. At present, most models focus on generating cross-sectional health data which is not necessarily representative of real data. In reality, medical data is longitudinal in nature, with a single patient having multiple health events, non-uniformly distributed throughout their lifetime. These events are influenced by patient covariates such as comorbidities, age group, gender etc. as well as external temporal effects (e.g. flu season). While there exist seminal methods to model time series data, it becomes increasingly challenging to extend these methods to medical event time series data. Due to the complexity of the real data, in which each patient visit is an event, we transform the data by using summary statistics to characterize the events for a fixed set of time intervals, to facilitate analysis and interpretability. We then train a generative adversarial network to generate synthetic data. We demonstrate this approach by generating human sleep patterns, from a publicly available dataset. We empirically evaluate the generated data and show close univariate resemblance between synthetic and real data. However, we also demonstrate how stratification by covariates is required to gain a deeper understanding of synthetic data quality.

قيم البحث

67 - Yang Chen , Dustin J. Kempton , Azim Ahmadzadeh 2021

One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest resulting in an extreme imbalance in the data. There have been many methods introduced in the literature for overcoming th is issue; simple data manipulation through undersampling and oversampling, utilizing cost-sensitive learning algorithms, or by generating synthetic data points following the distribution of the existing data. While synthetic data generation has recently received a great deal of attention, there are real challenges involved in doing so for high-dimensional data such as multivariate time series. In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling in order to balance a large dataset of multivariate time series. We utilize a flare forecasting benchmark dataset, named SWAN-SF, and design two verification methods to both quantitatively and qualitatively evaluate the similarity between the generated minority and the ground-truth samples. We further assess the quality of the generated samples by training a classical, supervised machine learning algorithm on synthetic data, and testing the trained model on the unseen, real data. The results show that the classifier trained on the data augmented with the synthetic multivariate time series achieves a significant improvement compared with the case where no augmentation is used. The popular flare forecasting evaluation metrics, TSS and HSS, report 20-fold and 5-fold improvements, respectively, indicating the remarkable statistical similarities, and the usefulness of CGAN-based data generation for complicated tasks such as flare forecasting.

التعلم الآلي

Generation of Synthetic Multi-Resolution Time Series Load Data

127 - Andrea Pinceti , Lalitha Sankar , Oliver Kosut 2021

The availability of large datasets is crucial for the development of new power system applications and tools; unfortunately, very few are publicly and freely available. We designed an end-to-end generative framework for the creation of synthetic bus- level time-series load data for transmission networks. The model is trained on a real dataset of over 70 Terabytes of synchrophasor measurements spanning multiple years. Leveraging a combination of principal component analysis and conditional generative adversarial network models, the scheme we developed allows for the generation of data at varying sampling rates (up to a maximum of 30 samples per second) and ranging in length from seconds to years. The generative models are tested extensively to verify that they correctly capture the diverse characteristics of real loads. Finally, we develop an open-source tool called LoadGAN which gives researchers access to the fully trained generative models via a graphical interface.

أنظمة وتحكم أنظمة وتحكم

Rapidly Personalizing Mobile Health Treatment Policies with Limited Data

80 - Sabina Tomkins , Peng Liao , Predrag Klasnja 2020

In mobile health (mHealth), reinforcement learning algorithms that adapt to ones context without learning personalized policies might fail to distinguish between the needs of individuals. Yet the high amount of noise due to the in situ delivery of mH ealth interventions can cripple the ability of an algorithm to learn when given access to only a single users data, making personalization challenging. We present IntelligentPooling, which learns personalized policies via an adaptive, principled use of other users data. We show that IntelligentPooling achieves an average of 26% lower regret than state-of-the-art across all generative models. Additionally, we inspect the behavior of this approach in a live clinical trial, demonstrating its ability to learn from even a small group of users.

التعلم الآلي أجهزة الكمبيوتر والمجتمع التعلم الالي

Process Model Forecasting Using Time Series Analysis of Event Sequence Data

75 - Johannes De Smedt , Anton Yeshchenko , Artem Polyvyanyy 2021

Process analytics is an umbrella of data-driven techniques which includes making predictions for individual process instances or overall process models. At the instance level, various novel techniques have been recently devised, tackling next activit y, remaining time, and outcome prediction. At the model level, there is a notable void. It is the ambition of this paper to fill this gap. To this end, we develop a technique to forecast the entire process model from historical event data. A forecasted model is a will-be process model representing a probable future state of the overall process. Such a forecast helps to investigate the consequences of drift and emerging bottlenecks. Our technique builds on a representation of event data as multiple time series, each capturing the evolution of a behavioural aspect of the process model, such that corresponding forecasting techniques can be applied. Our implementation demonstrates the accuracy of our technique on real-world event log data.

التعلم الآلي قواعد البيانات

Reproducibility in Machine Learning for Health

242 - Matthew B.A. McDermott 2019

Machine learning algorithms designed to characterize, monitor, and intervene on human health (ML4H) are expected to perform safely and reliably when operating at scale, potentially outside strict human supervision. This requirement warrants a stricte r attention to issues of reproducibility than other fields of machine learning. In this work, we conduct a systematic evaluation of over 100 recently published ML4H research papers along several dimensions related to reproducibility. We find that the field of ML4H compares poorly to more established machine learning fields, particularly concerning data and code accessibility. Finally, drawing from success in other fields of science, we propose recommendations to data providers, academic publishers, and the ML4H research community in order to promote reproducible research moving forward.

التعلم الآلي أجهزة الكمبيوتر والمجتمع التعلم الالي