No Arabic abstract
The pursuit of power-efficiency is popularizing asymmetric multicore processors (AMP) such as ARM big.LITTLE, Apple M1 and recent Intel Alder Lake with big and little cores. However, we find that existing scalable locks fail to scale on AMP and cause collapses in either throughput or latency, or both, because their implicit assumption of symmetric cores no longer holds. To address this issue, we propose the first asymmetry-aware scalable lock named LibASL. LibASL provides a new lock ordering guided by applications latency requirements, which allows big cores to reorder with little cores for higher throughput under the condition of preserving applications latency requirements. Using LibASL only requires linking the applications with it and, if latency-critical, inserting few lines of code to annotate the coarse-grained latency requirement. We evaluate LibASL in various benchmarks including five popular databases on Apple M1. Evaluation results show that LibASL can improve the throughput by up to 5 times while precisely preserving the tail latency designated by applications.
Cloud applications are increasingly shifting from large monolithic services to complex graphs of loosely-coupled microservices. Despite the advantages of modularity and elasticity microservices offer, they also complicate cluster management and performance debugging, as dependencies between tiers introduce backpressure and cascading QoS violations. We present Sage, a machine learning-driven root cause analysis system for interactive cloud microservices. Sage leverages unsupervised ML models to circumvent the overhead of trace labeling, captures the impact of dependencies between microservices to determine the root cause of unpredictable performance online, and applies corrective actions to recover a cloud services QoS. In experiments on both dedicated local clusters and large clusters on Google Compute Engine we show that Sage consistently achieves over 93% accuracy in correctly identifying the root cause of QoS violations, and improves performance predictability.
The FFT of three-dimensional (3D) input data is an important computational kernel of numerical simulations and is widely used in High Performance Computing (HPC) codes running on a large number of processors. Performance of many scientific applications such as Molecular Dynamic simulations depends on the underlying 3D parallel FFT library being used. In this paper, we present C-DACs three-dimensional Fast Fourier Transform (CROFT) library which implements three-dimensional parallel FFT using pencil decomposition. To exploit the hyperthreading capabilities of processor cores without affecting performance, CROFT is designed to use multithreading along with MPI. CROFT implementation has an innovative feature of overlapping compute and memory-I/O with MPI communication using multithreading. As opposed to other 3D FFT implementations, CROFT uses only two threads where one thread is dedicated for communication so that it can be effectively overlapped with computations. Thus, depending on the number of processes used, CROFT achieves performance improvement of about 51% to 42% as compared to FFTW3 library.
Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version.
A parallel program can be represented as a directed acyclic graph. An important performance bound is the time to execute the critical path through the graph. We show how this performance metric is related to Amdahl speedup and the degree of average parallelism. These bounds formally exclude superlinear performance.
We show how to quantify scalability with the Universal Scalability Law (USL) by applying it to performance measurements of memcached, J2EE, and Weblogic on multi-core platforms. Since commercial multicores are essentially black-boxes, the accessible performance gains are primarily available at the application level. We also demonstrate how our methodology can identify the most significant performance tuning opportunities to optimize application scalability, as well as providing an easy means for exploring other aspects of the multi-core system design space.