WCET analysis of multi-level set-associative instruction caches

124 0 0.0 ( 0 )

Download Cite

Added by Damien Hardy

Publication date 2008

fields Informatics Engineering

and research's language is English

Authors Damien Hardy

Performance

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

With the advent of increasingly complex hardware in real-time embedded systems (processors with performance enhancing features such as pipelines, cache hierarchy, multiple cores), many processors now have a set-associative L2 cache. Thus, there is a need for considering cache hierarchies when validating the temporal behavior of real-time systems, in particular when estimating tasks worst-case execution times (WCETs). To the best of our knowledge, there is only one approach for WCET estimation for systems with cache hierarchies [Mueller, 1997], which turns out to be unsafe for set-associative caches. In this paper, we highlight the conditions under which the approach described in [Mueller, 1997] is unsafe. A safe static instruction cache analysis method is then presented. Contrary to [Mueller, 1997] our method supports set-associative and fully associative caches. The proposed method is experimented on medium-size and large programs. We show that the method is most of the time tight. We further show that in all cases WCET estimations are much tighter when considering the cache hierarchy than when considering only the L1 cache. An evaluation of the analysis time is conducted, demonstrating that analysing the cache hierarchy has a reasonable computation time.

rate research

Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations

326 - Markus Wittmann , Georg Hager , Thomas Zeiser 2013

Memory-bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice-Boltzmann method (LBM) on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single-chip performance and power characteristics can be generalized to the highly parallel case. First we perform an analysis of a sparse-lattice LBM implementation for complex geometries. Using a single-core performance model, we predict the intra-chip saturation characteristics and the optimal operating point in terms of energy to solution as a function of implementation details, clock frequency, vectorization, and number of active cores per chip. We show that high single-core performance and a correct choice of the number of active cores per chip are the essential optimizations for lowest energy to solution at minimal performance degradation. Then we extrapolate to the MPI-parallel level and quantify the energy-saving potential of various optimizations and execution modes, where we find these guidelines to be even more important, especially when communication overhead is non-negligible. In our setup we could achieve energy savings of 35% in this case, compared to a naive approach. We also demonstrate that a simple non-reflective reduction of the clock speed leaves most of the energy saving potential unused.

Performance Distributed Parallel and Cluster Computing

GPA: A GPU Performance Advisor Based on Instruction Sampling

68 - Keren Zhou , Xiaozhu Meng , Ryuichi Sai 2020

Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained suggestions at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimization opportunities at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a programs structure and the GPU to match inefficiency patterns with suggestions for optimization. To quantify each suggestions potential benefits, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides an insightful report to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01$times$ to 3.53$times$, with a geometric mean of 1.22$times$.

Performance Distributed Parallel and Cluster Computing

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

64 - Jan Laukemann , Julian Hammer , Johannes Hofmann 2018

An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understanding of the performance-relevant interactions between hardware architecture and loop code. We present the Open Source Architecture Code Analyzer (OSACA), a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling. We show the process of building a machine model from available documentation and semi-automatic benchmarking, and carry it out for the latest Intel Skylake and AMD Zen micro-architectures. To validate the constructed models, we apply them to several assembly kernels and compare runtime predictions with actual measurements. Finally we give an outlook on how the method may be generalized to new architectures.

Performance Software Engineering

On Transformations of Load-Store Maurer Instruction Set Architecture

223 - Tie Hou 2009

In this paper, we study how certain conditions can affect the transformations on the states of the memory of a strict load-store Maurer ISA, when half of the data memory serves as the part of the operating unit.

Hardware Architecture

Acceleration of the tree method with SIMD instruction set

209 - Tetsushi Kodama , Tomoaki Ishiyama 2018

We have developed a highly-tuned software library that accelerates the calculation of quadrupole terms in the Barnes-Hut tree code by use of a SIMD instruction set on the x86 architecture, Advanced Vector eXtensions 2 (AVX2). Our code is implemented as an extension of Phantom-GRAPE software library (Tanikawa et al. 2012, 2013) that significantly accelerates the calculation of monopole terms. If the same accuracy is required, the calculation of quadrupole terms can accelerate the evaluation of forces than that of only monopole terms because we can approximate gravitational forces from closer particles by quadrupole moments than by only monopole moments. Our implementation can calculate gravitational forces about 1.1 times faster in any system than the combination of the pseudoparticle multipole method and Phantom-GRAPE. Our implementation allows simulating homogeneous systems up to 2.2 times faster than that with only monopole terms, however, speed up for clustered systems is not enough because the increase of approximated interactions is insufficient to negate the increased calculation cost by computing quadrupole terms. We have estimated that improvement in performance can be achieved by the use of a new SIMD instruction set, AVX-512. Our code is expected to be able to accelerate simulations of clustered systems up to 1.08 times faster on AVX-512 environment than that with only monopole terms.

Instrumentation and Methods for Astrophysics