No Arabic abstract
Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for performance and energy on homogeneous multicore clusters. We show in this work that bi-objective optimisation for performance and energy on heterogeneous processors results in a large number of Pareto-optimal optimal solutions (workload distributions) even in the simple case of linear performance and energy profiles. We then study performance and energy profiles of real-life data-parallel applications and find that their shapes are non-linear, complex and non-smooth. We, therefore, propose an efficient and exact global optimisation algorithm, which takes as an input most general discrete performance and dynamic energy profiles of the heterogeneous processors and solves the bi-objective optimisation problem. The algorithm is also used as a building block to solve the bi-objective optimisation problem for performance and total energy. We also propose a novel methodology to build discrete dynamic energy profiles of individual computing devices, which are input to the algorithm. The methodology is based purely on system-level measurements and addresses the fundamental challenge of accurate component-level energy modelling of a hybrid data-parallel application running on a heterogeneous platform integrating CPUs and accelerators. We experimentally validate the proposed method using two data-parallel applications, matrix multiplication and 2D fast Fourier transform (2D-FFT).
Serverless computing has rapidly grown following the launch of Amazons Lambda platform. Function-as-a-Service (FaaS) a key enabler of serverless computing allows an application to be decomposed into simple, standalone functions that are executed on a FaaS platform. The FaaS platform is responsible for deploying and facilitating resources to the functions. Several of todays cloud applications spread over heterogeneous connected computing resources and are highly dynamic in their structure and resource requirements. However, FaaS platforms are limited to homogeneous clusters and homogeneous functions and do not account for the data access behavior of functions before scheduling. We introduce an extension of FaaS to heterogeneous clusters and to support heterogeneous functions through a network of distributed heterogeneous target platforms called Function Delivery Network (FDN). A target platform is a combination of a cluster of homogeneous nodes and a FaaS platform on top of it. FDN provides Function-Delivery-as-a-Service (FDaaS), delivering the function to the right target platform. We showcase the opportunities such as varied target platforms characteristics, possibility of collaborative execution between multiple target platforms, and localization of data that the FDN offers in fulfilling two objectives: Service Level Objective (SLO) requirements and energy efficiency when scheduling functions by evaluating over five distributed target platforms using the FDNInspector, a tool developed by us for benchmarking distributed target platforms. Scheduling functions on an edge target platform in our evaluation reduced the overall energy consumption by 17x without violating the SLO requirements in comparison to scheduling on a high-end target platform.
This paper investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on different platforms. The paper shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the paper. Since the buoyancy driven cavity code is latency-bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the latency cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost changes on all of the clusters tested. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging CPUs memory. Finally, overlapping the communication and computations is shown to be not efficient on multi-GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough overlap, the actual running does not utilize the overlap well due to a lack of asynchronous progression.
Energy proportionality is the key design goal followed by architects of modern multicore CPUs. One of its implications is that optimization of an application for performance will also optimize it for energy. In this work, we show that energy proportionality does not hold true for multicore CPUs. This finding creates the opportunity for bi-objective optimization of applications for performance and energy. We propose and study the first application-level method for bi-objective optimization of multithreaded data-parallel applications for performance and energy. The method uses two decision variables, the number of identical multithreaded kernels (threadgroups) executing the application and the number of threads in each threadgroup, with the workload always partitioned equally between the threadgroups. We experimentally demonstrate the efficiency of the method using four highly optimized multithreaded data-parallel applications, 2D fast Fourier transform based on FFTW and Intel MKL, and dense matrix-matrix multiplication using OpenBLAS and Intel MKL. Four modern multicore CPUs are used in the experiments. The experiments show that optimization for performance alone results in the increase in dynamic energy consumption by up to 89% and optimization for dynamic energy alone degrades the performance by up to 49%. By solving the bi-objective optimization problem, the method determines up to 11 globally Pareto-optimal solutions. Finally, we propose a qualitative dynamic energy model employing performance monitoring counters as parameters, which we use to explain the discovered energy nonproportionality and the Pareto-optimal solutions determined by our method. The model shows that the energy nonproportionality in our case is due to the activity of the data translation lookaside buffer (dTLB), which is disproportionately energy expensive.
Energy is now a first-class design constraint along with performance in all computing settings. Energy predictive modelling based on performance monitoring counts (PMCs) is the leading method used for prediction of energy consumption during an application execution. We use a model-theoretic approach to formulate the assumed properties of existing models in a mathematical form. We extend the formalism by adding properties, heretofore unconsidered, that account for a limited form of energy conservation law. The extended formalism defines our theory of energy of computing. By applying the basic practical implications of the theory, we improve the prediction accuracy of state-of-the-art energy models from 31% to 18%. We also demonstrate that use of state-of-the-art measurement tools for energy optimisation may lead to significant losses of energy (ranging from 56% to 65% for applications used in experiments) since they do not take into account the energy conservation properties.
This work proposes a methodology to find performance and energy trade-offs for parallel applications running on Heterogeneous Multi-Processing systems with a single instruction-set architecture. These offer flexibility in the form of different core types and voltage and frequency pairings, defining a vast design space to explore. Therefore, for a given application, choosing a configuration that optimizes the performance and energy consumption is not straightforward. Our method proposes novel analytical models for performance and power consumption whose parameters can be fitted using only a few strategically sampled offline measurements. These models are then used to estimate an applications performance and energy consumption for the whole configuration space. In turn, these offline predictions define the choice of estimated Pareto-optimal configurations of the model, which are used to inform the selection of the configuration that the application should be executed on. The methodology was validated on an ODROID-XU3 board for eight programs from the PARSEC Benchmark, Phoronix Test Suite and Rodinia applications. The generated Pareto-optimal configuration space represented a 99% reduction of the universe of all available configurations. Energy savings of up to 59.77%, 61.38% and 17.7% were observed when compared to the performance, ondemand and powersave Linux governors, respectively, with higher or similar performance.