On the need for synthetic data and robust data simulators in the 2020s

54 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Molly Peeples

تاريخ النشر 2019

مجال البحث فيزياء

والبحث باللغة English

تأليف Molly S. Peeples

الأجهزة والأساليب للزيئات الفيزياء الفلكية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

As observational datasets become larger and more complex, so too are the questions being asked of these data. Data simulations, i.e., synthetic data with properties (pixelization, noise, PSF, artifacts, etc.) akin to real data, are therefore increasingly required for several purposes, including: (1) testing complicated measurement methods, (2) comparing models and astrophysical simulations to observations in a manner that requires as few assumptions about the data as possible, (3) predicting observational results based on models and astrophysical simulations for, e.g., proposal planning, and (4) mitigating risk for future observatories and missions by effectively priming and testing pipelines. We advocate for an increase in using synthetic data to plan for and interpret real observations as a matter of routine. This will require funding for (1) facilities to provide robust data simulators for their instruments, telescopes, and surveys, and (2) making synthetic data publicly available in archives (much like real data) so as to lower the barrier of entry to all.

قيم البحث

90 - Hari Prasanna Das , Ryan Tran , Japjot Singh 2021

$textbf{Background:}$ At the onset of a pandemic, such as COVID-19, data with proper labeling/attributes corresponding to the new disease might be unavailable or sparse. Machine Learning (ML) models trained with the available data, which is limited i n quantity and poor in diversity, will often be biased and inaccurate. At the same time, ML algorithms designed to fight pandemics must have good performance and be developed in a time-sensitive manner. To tackle the challenges of limited data, and label scarcity in the available data, we propose generating conditional synthetic data, to be used alongside real data for developing robust ML models. $textbf{Methods:}$ We present a hybrid model consisting of a conditional generative flow and a classifier for conditional synthetic data generation. The classifier decouples the feature representation for the condition, which is fed to the flow to extract the local noise. We generate synthetic data by manipulating the local noise with fixed conditional feature representation. We also propose a semi-supervised approach to generate synthetic samples in the absence of labels for a majority of the available data. $textbf{Results:}$ We performed conditional synthetic generation for chest computed tomography (CT) scans corresponding to normal, COVID-19, and pneumonia afflicted patients. We show that our method significantly outperforms existing models both on qualitative and quantitative performance, and our semi-supervised approach can efficiently synthesize conditional samples under label scarcity. As an example of downstream use of synthetic data, we show improvement in COVID-19 detection from CT scans with conditional synthetic data augmentation.

التعلم الآلي

Historical astronomical data: urgent need for preservation, digitization enabling scientific exploration

98 - Alexei Pevtsov , Elizabeth Griffin , Jonathan Grindlay 2019

Over the past decades and even centuries, the astronomical community has accumulated a signif-icant heritage of recorded observations of a great many astronomical objects. Those records con-tain irreplaceable information about long-term evolutionary and non-evolutionary changes in our Universe, and their preservation and digitization is vital. Unfortunately, most of those data risk becoming degraded and thence totally lost. We hereby call upon the astronomical community and US funding agencies to recognize the gravity of the situation, and to commit to an interna-tional preservation and digitization efforts through comprehensive long-term planning supported by adequate resources, prioritizing where the expected scientific gains, vulnerability of the origi-nals and availability of relevant infrastructure so dictates. The importance and urgency of this issue has been recognized recently by General Assembly XXX of the International Astronomical Union (IAU) in its Resolution B3: on preservation, digitization and scientific exploration of his-torical astronomical data. We outline the rationale of this promotion, provide examples of new science through successful recovery efforts, and review the potential losses to science if nothing it done.

الأجهزة والأساليب للزيئات الفيزياء الفلكية الفيزياء الفلكية الشمسية والنجوم

On the Applicability of Synthetic Data for Face Recognition

125 - Haoyu Zhang , Marcel Grimmer , Raghavendra Ramachandra 2021

Face verification has come into increasing focus in various applications including the European Entry/Exit System, which integrates face recognition mechanisms. At the same time, the rapid advancement of biometric authentication requires extensive pe rformance tests in order to inhibit the discriminatory treatment of travellers due to their demographic background. However, the use of face images collected as part of border controls is restricted by the European General Data Protection Law to be processed for no other reason than its original purpose. Therefore, this paper investigates the suitability of synthetic face images generated with StyleGAN and StyleGAN2 to compensate for the urgent lack of publicly available large-scale test data. Specifically, two deep learning-based (SER-FIQ, FaceQnet v1) and one standard-based (ISO/IEC TR 29794-5) face image quality assessment algorithm is utilized to compare the applicability of synthetic face images compared to real face images extracted from the FRGC dataset. Finally, based on the analysis of impostor score distributions and utility score distributions, our experiments reveal negligible differences between StyleGAN vs. StyleGAN2, and further also minor discrepancies compared to real face images.

الرؤية الحاسوبية وتمييز الأنماط التشفير والأمن

How much real data do we actually need: Analyzing object detection performance using synthetic and real data

116 - Farzan Erlik Nowruzi , Prince Kapoor , Dhanvin Kolhatkar 2019

In recent years, deep learning models have resulted in a huge amount of progress in various areas, including computer vision. By nature, the supervised training of deep models requires a large amount of data to be available. This ideal case is usuall y not tractable as the data annotation is a tremendously exhausting and costly task to perform. An alternative is to use synthetic data. In this paper, we take a comprehensive look into the effects of replacing real data with synthetic data. We further analyze the effects of having a limited amount of real data. We use multiple synthetic and real datasets along with a simulation tool to create large amounts of cheaply annotated synthetic data. We analyze the domain similarity of each of these datasets. We provide insights about designing a methodological procedure for training deep networks using these datasets.

الرؤية الحاسوبية وتمييز الأنماط

The LSST DESC Data Challenge 1: Generation and Analysis of Synthetic Images for Next Generation Surveys

110 - F. Javier Sanchez , Chris W. Walter , Humna Awan 2020

Data Challenge 1 (DC1) is the first synthetic dataset produced by the Rubin Observatory Legacy Survey of Space and Time (LSST) Dark Energy Science Collaboration (DESC). DC1 is designed to develop and validate data reduction and analysis and to study the impact of systematic effects that will affect the LSST dataset. DC1 is comprised of $r$-band observations of 40 deg$^{2}$ to 10-year LSST depth. We present each stage of the simulation and analysis process: a) generation, by synthesizing sources from cosmological N-body simulations in individual sensor-visit images with different observing conditions; b) reduction using a development version of the LSST Science Pipelines; and c) matching to the input cosmological catalog for validation and testing. We verify that testable LSST requirements pass within the fidelity of DC1. We establish a selection procedure that produces a sufficiently clean extragalactic sample for clustering analyses and we discuss residual sample contamination, including contributions from inefficiency in star-galaxy separation and imperfect deblending. We compute the galaxy power spectrum on the simulated field and conclude that: i) survey properties have an impact of 50% of the statistical uncertainty for the scales and models used in DC1 ii) a selection to eliminate artifacts in the catalogs is necessary to avoid biases in the measured clustering; iii) the presence of bright objects has a significant impact (2- to 6-$sigma$) in the estimated power spectra at small scales ($ell > 1200$), highlighting the impact of blending in studies at small angular scales in LSST;

الأجهزة والأساليب للزيئات الفيزياء الفلكية

سجل دخول لتتمكن من نشر تعليقات