Plan-based Job Scheduling for Supercomputers with Shared Burst Buffers

58 0 0.0 ( 0 )

Download Cite

Added by Jan Kopanski

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jan Kopanski - Krzysztof Rzadca

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The ever-increasing gap between compute and I/O performance in HPC platforms, together with the development of novel NVMe storage devices (NVRAM), led to the emergence of the burst buffer concept - an intermediate persistent storage layer logically positioned between random-access main memory and a parallel file system. Despite the development of real-world architectures as well as research concepts, resource and job management systems, such as Slurm, provide only marginal support for scheduling jobs with burst buffer requirements, in particular ignoring burst buffers when backfilling. We investigate the impact of burst buffer reservations on the overall efficiency of online job scheduling for common algorithms: First-Come-First-Served (FCFS) and Shortest-Job-First (SJF) EASY-backfilling. We evaluate the algorithms in a detailed simulation with I/O side effects. Our results indicate that the lack of burst buffer reservations in backfilling may significantly deteriorate scheduling. We also show that these algorithms can be easily extended to support burst buffers. Finally, we propose a burst-buffer-aware plan-based scheduling algorithm with simulated annealing optimisation, which improves the mean waiting time by over 20% and mean bounded slowdown by 27% compared to the burst-buffer-aware SJF-EASY-backfilling.

rate research

The Case for Task Sampling based Learning for Cluster Job Scheduling

329 - Akshay Jajoo , Y. Charlie Hu , Xiaojun Lin 2021

The ability to accurately estimate job runtime properties allows a scheduler to effectively schedule jobs. State-of-the-art online cluster job schedulers use history-based learning, which uses past job execution information to estimate the runtime properties of newly arrived jobs. However, with fast-paced development in cluster technology (in both hardware and software) and changing user inputs, job runtime properties can change over time, which lead to inaccurate predictions. In this paper, we explore the potential and limitation of real-time learning of job runtime properties, by proactively sampling and scheduling a small fraction of the tasks of each job. Such a task-sampling-based approach exploits the similarity among runtime properties of the tasks of the same job and is inherently immune to changing job behavior. Our study focuses on two key questions in comparing task-sampling-based learning (learning in space) and history-based learning (learning in time): (1) Can learning in space be more accurate than learning in time? (2) If so, can delaying scheduling the remaining tasks of a job till the completion of sampled tasks be more than compensated by the improved accuracy and result in improved job performance? Our analytical and experimental analysis of 3 production traces with different skew and job distribution shows that learning in space can be substantially more accurate. Our simulation and testbed evaluation on Azure of the two learning approaches anchored in a generic job scheduler using 3 production cluster job traces shows that despite its online overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28x, 1.56x, and 1.32x compared to the prior-art history-based predictor.

Distributed Parallel and Cluster Computing

ROME: A Multi-Resource Job Scheduling Framework for Exascale HPC Systems

175 - Yuping Fan 2021

High-performance computing (HPC) is undergoing significant changes. Next generation HPC systems are equipped with diverse global and local resources, such as I/O burst buffer resources, memory resources (e.g., on-chip and off-chip RAM, external RAM/NVRA), network resources, and possibly other resources. Job schedulers play a crucial role in efficient use of resources. However, traditional job schedulers are single-objective and fail to efficient use of other resources. In this paper, we propose ROME, a novel multi-dimensional job scheduling framework to explore potential tradeoffs among multiple resources and provides balanced scheduling decision. Our design leverages genetic algorithm as the multi-dimensional optimization engine to generate fast scheduling decision and to support effective resource utilization.

Distributed Parallel and Cluster Computing

Bidding policies for market-based HPC workflow scheduling

101 - Andrew Burkimsher , Leandro Soares Indrusiak 2016

This paper considers the scheduling of jobs on distributed, heterogeneous High Performance Computing (HPC) clusters. Market-based approaches are known to be efficient for allocating limited resources to those that are most prepared to pay. This context is applicable to an HPC or cloud computing scenario where the platform is overloaded. In this paper, jobs are composed of dependent tasks. Each job has a non-increasing time-value curve associated with it. Jobs are submitted to and scheduled by a market-clearing centralised auctioneer. This paper compares the performance of several policies for generating task bids. The aim investigated here is to maximise the value for the platform provider while minimising the number of jobs that do not complete (or starve). It is found that the Projected Value Remaining bidding policy gives the highest level of value under a typical overload situation, and gives the lowest number of starved tasks across the space of utilisation examined. It does this by attempting to capture the urgency of tasks in the queue. At high levels of overload, some alternative algorithms produce slightly higher value, but at the cost of a hugely higher number of starved workflows.

Distributed Parallel and Cluster Computing

Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud System

165 - Yiwen Han , Shihao Shen , Xiaofei Wang 2021

Kubernetes (k8s) has the potential to merge the distributed edge and the cloud but lacks a scheduling framework specifically for edge-cloud systems. Besides, the hierarchical distribution of heterogeneous resources and the complex dependencies among requests and resources make the modeling and scheduling of k8s-oriented edge-cloud systems particularly sophisticated. In this paper, we introduce KaiS, a learning-based scheduling framework for such edge-cloud systems to improve the long-term throughput rate of request processing. First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch and dynamic dispatch spaces within the edge cluster. Second, for diverse system scales and structures, we use graph neural networks to embed system state information, and combine the embedding results with multiple policy networks to reduce the orchestration dimensionality by stepwise scheduling. Finally, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration, and present the implementation design of deploying the above algorithms compatible with native k8s components. Experiments using real workload traces show that KaiS can successfully learn appropriate scheduling policies, irrespective of request arrival patterns and system scales. Moreover, KaiS can enhance the average system throughput rate by 14.3% while reducing scheduling cost by 34.7% compared to baselines.

Distributed Parallel and Cluster Computing Artificial Intelligence

A Case for Sampling Based Learning Techniques in Coflow Scheduling

101 - Akshay Jajoo , Y. Charlie Hu , Xiaojun Lin 2021

Coflow scheduling improves data-intensive application performance by improving their networking performance. State-of-the-art online coflow schedulers in essence approximate the classic Shortest-Job-First (SJF) scheduling by learning the coflow size online. In particular, they use multiple priority queues to simultaneously accomplish two goals: to sieve long coflows from short coflows, and to schedule short coflows with high priorities. Such a mechanism pays high overhead in learning the coflow size: moving a large coflow across the queues delays small and other large coflows, and moving similar-sized coflows across the queues results in inadvertent round-robin scheduling. We propose Philae, a new online coflow scheduler that exploits the spatial dimension of coflows, i.e., a coflow has many flows, to drastically reduce the overhead of coflow size learning. Philae pre-schedules sampled flows of each coflow and uses their sizes to estimate the average flow size of the coflow. It then resorts to Shortest Coflow First, where the notion of shortest is determined using the learned coflow sizes and coflow contention. We show that the sampling-based learning is robust to flow size skew and has the added benefit of much improved scalability from reduced coordinator-local agent interactions. Our evaluation using an Azure testbed, a publicly available production cluster trace from Facebook shows that compared to the prior art Aalo, Philae reduces the coflow completion time (CCT) in average (P90) cases by 1.50x (8.00x) on a 150-node testbed and 2.72x (9.78x) on a 900-node testbed. Evaluation using additional traces further demonstrates Philaes robustness to flow size skew.

Distributed Parallel and Cluster Computing