Online Anomaly Detection in HPC Systems

116 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Andrea Borghesi

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Andrea Borghesi - Antonio Libri - Luca Benini

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.

قيم البحث

123 - Yuping Fan , Paul Rich , William Allcock 2021

Traditionally, on-demand, rigid, and malleable applications have been scheduled and executed on separate systems. The ever-growing workload demands and rapidly developing HPC infrastructure trigger the interest of converging these applications on a s ingle HPC system. Although allocating the hybrid workloads within one system could potentially improve system efficiency, it is difficult to balance the tradeoff between the responsiveness of on-demand requests, the incentive for malleable jobs, and the performance of rigid applications. In this study, we present several scheduling mechanisms to address the issues involved in co-scheduling on-demand, rigid, and malleable jobs on a single HPC system. We extensively evaluate and compare their performance under various configurations and workloads. Our experimental results show that our proposed mechanisms are capable of serving on-demand workloads with minimal delay, offering incentives for declaring malleability, and improving system performance.

النظم الموزعة والتوازية والحوسبة العنقودية

ROME: A Multi-Resource Job Scheduling Framework for Exascale HPC Systems

175 - Yuping Fan 2021

High-performance computing (HPC) is undergoing significant changes. Next generation HPC systems are equipped with diverse global and local resources, such as I/O burst buffer resources, memory resources (e.g., on-chip and off-chip RAM, external RAM/N VRA), network resources, and possibly other resources. Job schedulers play a crucial role in efficient use of resources. However, traditional job schedulers are single-objective and fail to efficient use of other resources. In this paper, we propose ROME, a novel multi-dimensional job scheduling framework to explore potential tradeoffs among multiple resources and provides balanced scheduling decision. Our design leverages genetic algorithm as the multi-dimensional optimization engine to generate fast scheduling decision and to support effective resource utilization.

النظم الموزعة والتوازية والحوسبة العنقودية

Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems

264 - Jonathan Vincent , Jing Gong , Martin Karp 2021

We present new results on the strong parallel scaling for the OpenACC-accelerated implementation of the high-order spectral element fluid dynamics solver Nek5000. The test case considered consists of a direct numerical simulation of fully-developed t urbulent flow in a straight pipe, at two different Reynolds numbers $Re_tau=360$ and $Re_tau=550$, based on friction velocity and pipe radius. The strong scaling is tested on several GPU-enabled HPC systems, including the Swiss Piz Daint system, TACCs Longhorn, Julichs JUWELS Booster, and Berzelius in Sweden. The performance results show that speed-up between 3-5 can be achieved using the GPU accelerated version compared with the CPU version on these different systems. The run-time for 20 timesteps reduces from 43.5 to 13.2 seconds with increasing the number of GPUs from 64 to 512 for $Re_tau=550$ case on JUWELS Booster system. This illustrates the GPU accelerated version the potential for high throughput. At the same time, the strong scaling limit is significantly larger for GPUs, at about $2000-5000$ elements per rank; compared to about $50-100$ for a CPU-rank.

النظم الموزعة والتوازية والحوسبة العنقودية

Online Memory Leak Detection in the Cloud-based Infrastructures

96 - Anshul Jindal , Paul Staab , Jorge Cardoso 2021

A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, to identify and ultimately resolve it quickly is highly important. However, in the production environment running on the cloud, memory leak detection is a challenge without the knowledge of the application or its internal object allocation details. This paper addresses this challenge of online detection of memory leaks in cloud-based infrastructure without having any internal application knowledge by introducing a novel machine learning based algorithm Precog. This algorithm solely uses one metric i.e the systems memory utilization on which the application is deployed for the detection of a memory leak. The developed algorithms accuracy was tested on 60 virtual machines manually labeled memory utilization data provided by our industry partner Huawei Munich Research Center and it was found that the proposed algorithm achieves the accuracy score of 85% with less than half a second prediction time per virtual machine.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي

Container solutions for HPC Systems: A Case Study of Using Shifter on Blue Waters

145 - Maxim Belkin , Roland Haas , Galen Wesley Arnold 2018

Software container solutions have revolutionized application development approaches by enabling lightweight platform abstractions within the so-called containers. Several solutions are being actively developed in attempts to bring the benefits of con tainers to high-performance computing systems with their stringent security demands on the one hand and fundamental resource sharing requirements on the other. In this paper, we discuss the benefits and short-comings of such solutions when deployed on real HPC systems and applied to production scientific applications.We highlight use cases that are either enabled by or significantly benefit from such solutions. We discuss the efforts by HPC system administrators and support staff to support users of these type of workloads on HPC systems not initially designed with these workloads in mind focusing on NCSAs Blue Waters system.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات