GB-PANDAS: Throughput and heavy-traffic optimality analysis for affinity scheduling

75 0 0.0 ( 0 )

Download Cite

Added by Ali Yekkehkhany

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Ali Yekkehkhany - Avesta Hojjati - Mohammad H Hajiesmaili

Performance

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Dynamic affinity scheduling has been an open problem for nearly three decades. The problem is to dynamically schedule multi-type tasks to multi-skilled servers such that the resulting queueing system is both stable in the capacity region (throughput optimality) and the mean delay of tasks is minimized at high loads near the boundary of the capacity region (heavy-traffic optimality). As for applications, data-intensive analytics like MapReduce, Hadoop, and Dryad fit into this setting, where the set of servers is heterogeneous for different task types, so the pair of task type and server determines the processing rate of the task. The load balancing algorithm used in such frameworks is an example of affinity scheduling which is desired to be both robust and delay optimal at high loads when hot-spots occur. Fluid model planning, the MaxWeight algorithm, and the generalized $cmu$-rule are among the first algorithms proposed for affinity scheduling that have theoretical guarantees on being optimal in different senses, which will be discussed in the related work section. All these algorithms are not practical for use in data center applications because of their non-realistic assumptions. The join-the-shortest-queue-MaxWeight (JSQ-MaxWeight), JSQ-Priority, and weighted-workload algorithms are examples of load balancing policies for systems with two and three levels of data locality with a rack structure. In this work, we propose the Generalized-Balanced-Pandas algorithm (GB-PANDAS) for a system with multiple levels of data locality and prove its throughput optimality. We prove this result under an arbitrary distribution for service times, whereas most previous theoretical work assumes geometric distribution for service times. The extensive simulation results show that the GB-PANDAS algorithm alleviates the mean delay and has a better performance than the JSQ-MaxWeight algorithm by twofold

rate research

Blind GB-PANDAS: A Blind Throughput-Optimal Load Balancing Algorithm for Affinity Scheduling

70 - Ali Yekkehkhany , Rakesh Nagi 2019

Dynamic affinity load balancing of multi-type tasks on multi-skilled servers, when the service rate of each task type on each of the servers is known and can possibly be different from each other, is an open problem for over three decades. The goal is to do task assignment on servers in a real time manner so that the system becomes stable, which means that the queue lengths do not diverge to infinity in steady state (throughput optimality), and the mean task completion time is minimized (delay optimality). The fluid model planning, Max-Weight, and c-$mu$-rule algorithms have theoretical guarantees on optimality in some aspects for the affinity problem, but they consider a complicated queueing structure and either require the task arrival rates, the service rates of tasks on servers, or both. In many cases that are discussed in the introduction section, both task arrival rates and service rates of different task types on different servers are unknown. In this work, the Blind GB-PANDAS algorithm is proposed which is completely blind to task arrival rates and service rates. Blind GB-PANDAS uses an exploration-exploitation approach for load balancing. We prove that Blind GB-PANDAS is throughput optimal under arbitrary and unknown distributions for service times of different task types on different servers and unknown task arrival rates. Blind GB-PANDAS desires to route an incoming task to the server with the minimum weighted-workload, but since the service rates are unknown, such routing of incoming tasks is not guaranteed which makes the throughput optimality analysis more complicated than the case where service rates are known. Our extensive experimental results reveal that Blind GB-PANDAS significantly outperforms existing methods in terms of mean task completion time at high loads.

Performance

Scheduling in a random environment: stability and asymptotic optimality

111 - U. Ayesta , M. Erausquin , M. Jonckheere 2011

We investigate the scheduling of a common resource between several concurrent users when the feasible transmission rate of each user varies randomly over time. Time is slotted and users arrive and depart upon service completion. This may model for example the flow-level behavior of end-users in a narrowband HDR wireless channel (CDMA 1xEV-DO). As performance criteria we consider the stability of the system and the mean delay experienced by the users. Given the complexity of the problem we investigate the fluid-scaled system, which allows to obtain important results and insights for the original system: (1) We characterize for a large class of scheduling policies the stability conditions and identify a set of maximum stable policies, giving in each time slot preference to users being in their best possible channel condition. We find in particular that many opportunistic scheduling policies like Score-Based, Proportionally Best or Potential Improvement are stable under the maximum stability conditions, whereas the opportunistic scheduler Relative-Best or the cmu-rule are not. (2) We show that choosing the right tie-breaking rule is crucial for the performance (e.g. average delay) as perceived by a user. We prove that a policy is asymptotically optimal if it is maximum stable and the tie-breaking rule gives priority to the user with the highest departure probability. We will refer to such tie-breaking rule as myopic. (3) We derive the growth rates of the number of users in the system in overload settings under various policies, which give additional insights on the performance. (4) We conclude that simple priority-index policies with the myopic tie-breaking rule, are stable and asymptotically optimal. All our findings are validated with extensive numerical experiments.

Performance

Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

66 - Jan Laukemann , Julian Hammer , Georg Hager 2019

Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions between hardware architecture and loop code. The Open Source Architecture Code Analyzer (OSACA) is a static analysis tool for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) architectures and simple, optimistic full-throughput execution. We have heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool. We show runtime predictions for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements.

Performance

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

64 - Jan Laukemann , Julian Hammer , Johannes Hofmann 2018

An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understanding of the performance-relevant interactions between hardware architecture and loop code. We present the Open Source Architecture Code Analyzer (OSACA), a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling. We show the process of building a machine model from available documentation and semi-automatic benchmarking, and carry it out for the latest Intel Skylake and AMD Zen micro-architectures. To validate the constructed models, we apply them to several assembly kernels and compare runtime predictions with actual measurements. Finally we give an outlook on how the method may be generalized to new architectures.

Performance Software Engineering

Performance Analysis of Priority-Aware NoCs with Deflection Routing under Traffic Congestion

172 - Sumit K. Mandal , Anish Krishnakumar , Raid Ayoub 2020

Priority-aware networks-on-chip (NoCs) are used in industry to achieve predictable latency under different workload conditions. These NoCs incorporate deflection routing to minimize queuing resources within routers and achieve low latency during low traffic load. However, deflected packets can exacerbate congestion during high traffic load since they consume the NoC bandwidth. State-of-the-art analytical models for priority-aware NoCs ignore deflected traffic despite its significant latency impact during congestion. This paper proposes a novel analytical approach to estimate end-to-end latency of priority-aware NoCs with deflection routing under bursty and heavy traffic scenarios. Experimental evaluations show that the proposed technique outperforms alternative approaches and estimates the average latency for real applications with less than 8% error compared to cycle-accurate simulations.

Performance