ﻻ يوجد ملخص باللغة العربية
Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique for increasing HPC system utilization is to colocate multiple applications on the same server. When applications share critical resources, however, contention on shared resources may lead to reduced application performance. In this paper, we show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters, and then exploiting the model to determine an optimized mix of colocated applications. This paper presents a new intelligent resource manager and makes the following contributions: (1) a new machine learning model to predict the performance degradation of colocated applications based on hardware counters and (2) an intelligent scheduling scheme deployed on an existing resource manager to enable application co-scheduling with minimum performance degradation. Our results show that our approach achieves performance improvements of 7% (avg) and 12% (max) compared to the standard policy commonly used by existing job managers.
Energy consumption is increasingly becoming a limiting factor to the design of faster large-scale parallel systems, and development of energy-efficient and energy-aware applications is today a relevant issue for HPC code-developer communities. In thi
Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data m
The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively,
With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows, heterogen
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial ben