Optimizing ROOT IO For Analysis

54 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zhe Zhang

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Brian Bockelman - Zhe Zhang - Jim Pivarski

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The ROOT I/O (RIO) subsystem is foundational to most HEP experiments - it provides a file format, a set of APIs/semantics, and a reference implementation in C++. It is often found at the base of an experiments framework and is used to serialize the experiments data; in the case of an LHC experiment, this may be hundreds of petabytes of files! Individual physicists will further use RIO to perform their end-stage analysis, reading from intermediate files they generate from experiment data. RIO is thus incredibly flexible: it must serve as a file format for archival (optimized for space) and for working data (optimized for read speed). To date, most of the technical work has focused on improving the former use case. We present work designed to help improve RIO for analysis. We analyze the real-world impact of LZ4 to decrease decompression times (and the corresponding cost in disk space). We introduce new APIs that read RIO data in bulk, removing the per-event overhead of a C++ function call. We compare the performance with the existing RIO APIs for simple structure data and show how this can be complimentary with efforts to improve the parallelism of the RIO stack.

قيم البحث

71 - Joseph Huber , Weile Wei , Giorgis Georgakoudis 2021

This paper presents a methodology for using LLVM-based tools to tune the DCA++ (dynamical clusterapproximation) application that targets the new ARM A64FX processor. The goal is to describethe changes required for the new architecture and generate ef ficient single instruction/multiple data(SIMD) instructions that target the new Scalable Vector Extension instruction set. During manualtuning, the authors used the LLVM tools to improve code parallelization by using OpenMP SIMD,refactored the code and applied transformation that enabled SIMD optimizations, and ensured thatthe correct libraries were used to achieve optimal performance. By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor. The authorsaim to automatize parts of the efforts in the OpenMP Advisor tool, which is built on top of existingand newly introduced LLVM tooling.

النظم الموزعة والتوازية والحوسبة العنقودية علم المواد هندسة العتاد

Big Data Staging with MPI-IO for Interactive X-ray Science

94 - Justin M. Wozniak , Hemant Sharma , Timothy G. Armstrong andn Michael Wilde 2020

New techniques in X-ray scattering science experiments produce large data sets that can require millions of high-performance processing hours per week of computation for analysis. In such applications, data is typically moved from X-ray detectors to a large parallel file system shared by all nodes of a petascale supercomputer and then is read repeatedly as different science application tasks proceed. However, this straightforward implementation causes significant contention in the file system. We propose an alternative approach in which data is instead staged into and cached in compute node memory for extended periods, during which time various processing tasks may efficiently access it. We describe here such a big data staging framework, based on MPI-IO and the Swift parallel scripting language. We discuss a range of large-scale data management issues involved in X-ray scattering science and measure the performance benefits of the new staging framework for high-energy diffraction microscopy, an important emerging application in data-intensive X-ray scattering. We show that our framework accelerates scientific processing turnaround from three months to under 10 minutes, and that our I/O technique reduces input overheads by a factor of 5 on 8K Blue Gene/Q nodes.

النظم الموزعة والتوازية والحوسبة العنقودية

Tuna: A Static Analysis Approach to Optimizing Deep Neural Networks

122 - Yao Wang , Xingyu Zhou , Yanming Wang 2021

We introduce Tuna, a static analysis approach to optimizing deep neural network programs. The optimization of tensor operations such as convolutions and matrix multiplications is the key to improving the performance of deep neural networks. Many deep learning model optimization mechanisms today use dynamic analysis, which relies on experimental execution on a target device to build a data-driven cost model of the program. The reliance on dynamic profiling not only requires access to target hardware at compilation time but also incurs significant cost in machine resources. We introduce an approach that profiles the program by constructing features based on the target hardware characteristics in order. We use static analysis of the relative performance of tensor operations to optimize the deep learning program. Experiments show that our approach can achieve up to 11x performance compared to dynamic profiling based methods with the same compilation time.

النظم الموزعة والتوازية والحوسبة العنقودية

Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming

154 - Zhao Zhang , Allan Espinosa , Kamil Iskra 2008

Loosely coupled programming is a powerful paradigm for rapidly creating higher-level applications from scientific programs on petascale systems, typically using scripting languages. This paradigm is a form of many-task computing (MTC) which focuses o n the passing of data between programs as ordinary files rather than messages. While it has the significant benefits of decoupling producer and consumer and allowing existing application programs to be executed in parallel with no recoding, its typical implementation using shared file systems places a high performance burden on the overall system and on the user who will analyze and consume the downstream data. Previous efforts have achieved great speedups with loosely coupled programs, but have done so with careful manual tuning of all shared file system access. In this work, we evaluate a prototype collective IO model for file-based MTC. The model enables efficient and easy distribution of input data files to computing nodes and gathering of output results from them. It eliminates the need for such manual tuning and makes the programming of large-scale clusters using a loosely coupled model easier. Our approach, inspired by in-memory approaches to collective operations for parallel programming, builds on fast local file systems to provide high-speed local file caches for parallel scripts, uses a broadcast approach to handle distribution of common input data, and uses efficient scatter/gather and caching techniques for input and output. We describe the design of the prototype model, its implementation on the Blue Gene/P supercomputer, and present preliminary measurements of its performance on synthetic benchmarks and on a large-scale molecular dynamics application.

النظم الموزعة والتوازية والحوسبة العنقودية

Optimizing Frameworks Performance Using C++ Modules Aware ROOT

98 - Yuka Takahashi 2018

ROOT is a data analysis framework broadly used in and outside of High Energy Physics (HEP). Since HEP software frameworks always strive for performance improvements, ROOT was extended with experimental support of runtime C++ Modules. C++ Modules are designed to improve the performance of C++ code parsing. C++ Modules offers a promising way to improve ROOTs runtime performance by saving the C++ header parsing time which happens during ROOT runtime. This paper presents the results and challenges of integrating C++ Modules into ROOT.

لغات البرمجة

سجل دخول لتتمكن من نشر تعليقات