Preparing for Performance Analysis at Exascale


Abstract in English

Performance tools for forthcoming heterogeneous exascale platforms must address two principal challenges when analyzing execution measurements. First, measurement of extreme-scale executions generates large volumes of performance data. Second, performance metrics for heterogeneous applications are significantly sparse across code regions. To address these challenges, we developed a novel streaming aggregation approach to post-mortem analysis that employs both shared and distributed memory parallelism to aggregate sparse performance measurements from every rank, thread and GPU stream of a large-scale application execution. Analysis results are stored in a pair of sparse formats designed for efficient access to related data elements, supporting responsive interactive presentation and scalable data analytics. Empirical analysis shows that our implementation of this approach in HPCToolkit effectively processes measurement data from thousands of threads using a fraction of the compute resources employed by the application itself. Our approach is able to perform analysis up to 9.4 times faster and store analysis results 23 times smaller than HPCToolkit, providing a key building block for scalable exascale performance tools.

Download