Real-Time XFEL Data Analysis at SLAC and NERSC: a Trial Run of Nascent Exascale Experimental Data Analysis

77 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Johannes Blaschke P

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية فيزياء

والبحث باللغة English

تأليف Johannes P. Blaschke

النظم الموزعة والتوازية والحوسبة العنقودية تحليل البيانات والإحصاءات والاحتمال

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

X-ray scattering experiments using Free Electron Lasers (XFELs) are a powerful tool to determine the molecular structure and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences analyzing data from two experiments at the Linac Coherent Light Source (LCLS) during September 2020. Raw data were analyzed on NERSCs Cori XC40 system, using the Superfacility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed using the software package CCTBX. We achieved real time data analysis with a turnaround time from data acquisition to full molecular reconstruction in as little as 10 min -- sufficient time for the experiments operators to make informed decisions. By hosting the data analysis on Cori, and by automating LCLS-NERSC interoperability, we achieved a data analysis rate which matches the data acquisition rate. Completing data analysis with 10 mins is a first for XFEL experiments and an important milestone if we are to keep up with data collection trends.

قيم البحث

228 - Ralph Kube , R. Michael Churchill , CS Chang 2021

While experiments on fusion plasmas produce high-dimensional data time series with ever increasing magnitude and velocity, data analysis has been lagging behind this development. For example, many data analysis tasks are often performed in a manual, ad-hoc manner some time after an experiment. In this article we introduce the DELTA framework that facilitates near real-time streaming analysis of big and fast fusion data. By streaming measurement data from fusion experiments to a high-performance compute center, DELTA allows to perform demanding data analysis tasks in between plasma pulses. This article describe the modular and expandable software architecture of DELTA and presents performance benchmarks of its individual components as well as of entire workflows. Our focus is on the streaming analysis of ECEi data measured at KSTAR on NERSCs supercomputers and we routinely achieve data transfer rates of about 500 Megabyte per second. We show that a demanding turbulence analysis workload can be distributed among multiple GPUs and executes in under 5 minutes. We further discuss how DELTA uses modern database systems and container orchestration services to provide web-based real-time data visualization. For the case of ECEi data we demonstrate how data visualizations can be augmented with outputs from machine learning models. By providing session leaders and physics operators results of higher order data analysis using live visualization they may monitor the evolution of a long-pulse discharge in near real-time and may make more informed decision on how to configure the machine for the next shot.

فيزياء البلازما تحليل البيانات والإحصاءات والاحتمال

Preparing for Performance Analysis at Exascale

109 - Jonathon Anderson , Yumeng Liu , John Mellor-Crummey 2021

Performance tools for forthcoming heterogeneous exascale platforms must address two principal challenges when analyzing execution measurements. First, measurement of extreme-scale executions generates large volumes of performance data. Second, perfor mance metrics for heterogeneous applications are significantly sparse across code regions. To address these challenges, we developed a novel streaming aggregation approach to post-mortem analysis that employs both shared and distributed memory parallelism to aggregate sparse performance measurements from every rank, thread and GPU stream of a large-scale application execution. Analysis results are stored in a pair of sparse formats designed for efficient access to related data elements, supporting responsive interactive presentation and scalable data analytics. Empirical analysis shows that our implementation of this approach in HPCToolkit effectively processes measurement data from thousands of threads using a fraction of the compute resources employed by the application itself. Our approach is able to perform analysis up to 9.4 times faster and store analysis results 23 times smaller than HPCToolkit, providing a key building block for scalable exascale performance tools.

النظم الموزعة والتوازية والحوسبة العنقودية الأداء

Real-time Data Infrastructure at Uber

180 - Yupeng Fu , Chinmay Soman 2021

Ubers business is highly real-time in nature. PBs of data is continuously being collected from the end users such as Uber drivers, riders, restaurants, eaters and so on everyday. There is a lot of valuable information to be processed and many decisio ns must be made in seconds for a variety of use cases such as customer incentives, fraud detection, machine learning model prediction. In addition, there is an increasing need to expose this ability to different user categories, including engineers, data scientists, executives and operations personnel which adds to the complexity. In this paper, we present the overall architecture of the real-time data infrastructure and identify three scaling challenges that we need to continuously address for each component in the architecture. At Uber, we heavily rely on open source technologies for the key areas of the infrastructure. On top of those open-source software, we add significant improvements and customizations to make the open-source solutions fit in Ubers environment and bridge the gaps to meet Ubers unique scale and requirements. We then highlight several important use cases and show their real-time solutions and tradeoffs. Finally, we reflect on the lessons we learned as we built, operated and scaled these systems.

النظم الموزعة والتوازية والحوسبة العنقودية قواعد البيانات

The Nested_fit data analysis program

136 - Martino Trassinelli 2019

We present here Nested_fit, a Bayesian data analysis code developed for investigations of atomic spectra and other physical data. It is based on the nested sampling algorithm with the implementation of an upgraded lawn mower robot method for finding new live points. For a given data set and a chosen model, the program provides the Bayesian evidence, for the comparison of different hypotheses/models, and the different parameter probability distributions. A large database of spectral profiles is already available (Gaussian, Lorentz, Voigt, Log-normal, etc.) and additional ones can easily added. It is written in Fortran, for an optimized parallel computation, and it is accompanied by a Python library for the results visualization.

الفيزياء الذرية تحليل البيانات والإحصاءات والاحتمال

Grain - A Java Analysis Framework for Total Data Readout

241 - P. Rahkila 2007

Grain is a data analysis framework developed to be used with the novel Total Data Readout data acquisition system. In Total Data Readout all the electronics channels are read out asynchronously in singles mode and each data item is timestamped. Event building and analysis has to be done entirely in the software post-processing the data stream. A flexible and efficient event parser and the accompanying software framework have been written entirely in Java. The design and implementation of the software are discussed along with experiences gained in running real-life experiments.

التجربة النووية تحليل البيانات والإحصاءات والاحتمال

سجل دخول لتتمكن من نشر تعليقات