No Arabic abstract
Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the underlying resources at runtime. Machine Learning (ML), providing black box solutions to model the relationship between application performance and system configuration without requiring in-detail knowledge of the system, has become a popular way of predicting the performance of big data applications. We investigate the cost-benefits of using supervised ML models for predicting the performance of applications on Spark, one of todays most widely used frameworks for big data analysis. We compare our approach with textit{Ernest} (an ML-based technique proposed in the literature by the Spark inventors) on a range of scenarios, application workloads, and cloud system configurations. Our experiments show that Ernest can accurately estimate the performance of very regular applications, but it fails when applications exhibit more irregular patterns and/or when extrapolating on bigger data set sizes. Results show that our models match or exceed Ernests performance, sometimes enabling us to reduce the prediction error from 126-187% to only 5-19%.
With the prevalence of big-data-driven applications, such as face recognition on smartphones and tailored recommendations from Google Ads, we are on the road to a lifestyle with significantly more intelligence than ever before. For example, Aipoly Vision [1] is an object and color recognizer that helps the blind, visually impaired, and color blind understand their surroundings. At the back end side of their intelligence, various neural networks powered models are running to enable quick responses to users. Supporting those models requires lots of cloud-based computational resources, e.g. CPUs and GPUs. The cloud providers charge their clients by the amount of resources that they occupied. From clients perspective, they have to balance the budget and quality of experiences (e.g. response time). The budget leans on individual business owners and the required Quality of Experience (QoE) depends on usage scenarios of different applications, for instance, an autonomous vehicle requires realtime response, but, unlocking your smartphone can tolerate delays. However, cloud providers fail to offer a QoE based option to their clients. In this paper, we propose DQoES, a differentiate quality of experience scheduler for deep learning applications. DQoES accepts clients specification on targeted QoEs, and dynamically adjust resources to approach their targets. Through extensive, cloud-based experiments, DQoES demonstrates that it can schedule multiple concurrent jobs with respect to various QoEs and achieve up to 8x times more satisfied models compared to the existing system.
Can cloud computing infrastructures provide HPC-competitive performance for scientific applications broadly? Despite prolific related literature, this question remains open. Answers are crucial for designing future systems and democratizing high-performance computing. We present a multi-level approach to investigate the performance gap between HPC and cloud computing, isolating different variables that contribute to this gap. Our experiments are divided into (i) hardware and system microbenchmarks and (ii) user application proxies. The results show that todays high-end cloud computing can deliver HPC-competitive performance not only for computationally intensive applications but also for memory- and communication-intensive applications - at least at modest scales - thanks to the high-speed memory systems and interconnects and dedicated batch scheduling now available on some cloud platforms.
Porting code from CPU to GPU is costly and time-consuming; Unless much time is invested in development and optimization, it is not obvious, a priori, how much speed-up is achievable or how much room is left for improvement. Knowing the potential speed-up a priori can be very useful: It can save hundreds of engineering hours, help programmers with prioritization and algorithm selection. We aim to address this problem using machine learning in a supervised setting, using solely the single-threaded source code of the program, without having to run or profile the code. We propose a static analysis-based cross-architecture performance prediction framework (Static XAPP) which relies solely on program properties collected using static analysis of the CPU source code and predicts whether the potential speed-up is above or below a given threshold. We offer preliminary results that show we can achieve 94% accuracy in binary classification, in average, across different thresholds
Cloud applications are increasingly shifting from large monolithic services to complex graphs of loosely-coupled microservices. Despite the advantages of modularity and elasticity microservices offer, they also complicate cluster management and performance debugging, as dependencies between tiers introduce backpressure and cascading QoS violations. We present Sage, a machine learning-driven root cause analysis system for interactive cloud microservices. Sage leverages unsupervised ML models to circumvent the overhead of trace labeling, captures the impact of dependencies between microservices to determine the root cause of unpredictable performance online, and applies corrective actions to recover a cloud services QoS. In experiments on both dedicated local clusters and large clusters on Google Compute Engine we show that Sage consistently achieves over 93% accuracy in correctly identifying the root cause of QoS violations, and improves performance predictability.
Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for performance and energy on homogeneous multicore clusters. We show in this work that bi-objective optimisation for performance and energy on heterogeneous processors results in a large number of Pareto-optimal optimal solutions (workload distributions) even in the simple case of linear performance and energy profiles. We then study performance and energy profiles of real-life data-parallel applications and find that their shapes are non-linear, complex and non-smooth. We, therefore, propose an efficient and exact global optimisation algorithm, which takes as an input most general discrete performance and dynamic energy profiles of the heterogeneous processors and solves the bi-objective optimisation problem. The algorithm is also used as a building block to solve the bi-objective optimisation problem for performance and total energy. We also propose a novel methodology to build discrete dynamic energy profiles of individual computing devices, which are input to the algorithm. The methodology is based purely on system-level measurements and addresses the fundamental challenge of accurate component-level energy modelling of a hybrid data-parallel application running on a heterogeneous platform integrating CPUs and accelerators. We experimentally validate the proposed method using two data-parallel applications, matrix multiplication and 2D fast Fourier transform (2D-FFT).