Building a scalable python distribution for HEP data analysis

113 0 0.0 ( 0 )

Download Cite

Added by David Lange

Publication date 2018

fields Physics

and research's language is English

Authors David Lange

Computational Physics

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

There are numerous approaches to building analysis applications across the high-energy physics community. Among them are Python-based, or at least Python-driven, analysis workflows. We aim to ease the adoption of a Python-based analysis toolkit by making it easier for non-expert users to gain access to Python tools for scientific analysis. Experimental software distributions and individual user analysis have quite different requirements. Distributions tend to worry most about stability, usability and reproducibility, while the users usually strive to be fast and nimble. We discuss how we built and now maintain a python distribution for analysis while satisfying requirements both a large software distribution (in our case, that of CMSSW) and user, or laptop, level analysis. We pursued the integration of tools used by the broader data science community as well as HEP developed (e.g., histogrammar, root_numpy) Python packages. We discuss concepts we investigated for package integration and testing, as well as issues we encountered through this process. Distribution and platform support are important topics. We discuss our approach and progress towards a sustainable infrastructure for supporting this Python stack for the CMS user community and for the broader HEP user community.

rate research

PyCraters: A Python framework for crater function analysis

387 - Scott A. Norris 2014

We introduce a Python framework designed to automate the most common tasks associated with the extraction and upscaling of the statistics of single-impact crater functions to inform coefficients of continuum equations describing surface morphology evolution. Designed with ease-of-use in mind, the framework allows users to extract meaningful statistical estimates with very short Python programs. Wrappers to interface with specific simulation packages, routines for statistical extraction of output, and fitting and differentiation libraries are all hidden behind simple, high-level user-facing functions. In addition, the framework is extensible, allowing advanced users to specify the collection of specialized statistics or the creation of customized plots. The framework is hosted on the BitBucket service under an open-source license, with the aim of helping non-specialists easily extract preliminary estimates of relevant crater function results associated with a particular experimental system.

Computational Physics Materials Science Computational Engineering

HEP Software Foundation Community White Paper Working Group - Data Analysis and Interpretation

434 - Lothar Bauerdick , Riccardo Maria Bianchi , Brian Bockelman 2018

At the heart of experimental high energy physics (HEP) is the development of facilities and instrumentation that provide sensitivity to new phenomena. Our understanding of nature at its most fundamental level is advanced through the analysis and interpretation of data from sophisticated detectors in HEP experiments. The goal of data analysis systems is to realize the maximum possible scientific potential of the data within the constraints of computing and human resources in the least time. To achieve this goal, future analysis systems should empower physicists to access the data with a high level of interactivity, reproducibility and throughput capability. As part of the HEP Software Foundation Community White Paper process, a working group on Data Analysis and Interpretation was formed to assess the challenges and opportunities in HEP data analysis and develop a roadmap for activities in this area over the next decade. In this report, the key findings and recommendations of the Data Analysis and Interpretation Working Group are presented.

Computational Physics High Energy Physics - Experiment

ProMC: Input-output data format for HEP applications using varint encoding

366 - S.V. Chekanov , E. May , K. Strand 2013

A new data format for Monte Carlo (MC) events, or any structural data, including experimental data, is discussed. The format is designed to store data in a compact binary form using variable-size integer encoding as implemented in the Googles Protocol Buffers package. This approach is implemented in the ProMC library which produces smaller file sizes for MC records compared to the existing input-output libraries used in high-energy physics (HEP). Other important features of the proposed format are a separation of abstract data layouts from concrete programming implementations, self-description and random access. Data stored in ProMC files can be written, read and manipulated in a number of programming languages, such C++, JAVA, FORTRAN and PYTHON.

Computational Physics Other Computer Science High Energy Physics - Experiment

Using Big Data Technologies for HEP Analysis

88 - Matteo Cremonesi , Claudio Bellini , Bianny Bian 2019

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother. In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis. The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.

Distributed Parallel and Cluster Computing

HEP Software Foundation Community White Paper Working Group -- Conditions Data

144 - Paul Laycock , Marko Bracko , Marco Clemencic 2019

To produce the best physics results, high energy physics experiments require access to calibration and other non-event data during event data processing. These conditions data are typically stored in databases that provide versioning functionality, allowing physicists to make improvements while simultaneously guaranteeing the reproducibility of their results. With the increased complexity of modern experiments, and the evolution of computing models that demand large scale access to conditions data, the solutions for managing this access have evolved over time. In this white paper we give an overview of the conditions data access problem, present convergence on a common solution and present some considerations for the future.

Computational Physics High Energy Physics - Experiment