Automatic Timing-Coherent Transactor Generation for Mixed-level Simulations

132 0 0.0 ( 0 )

Download Cite

Added by Ren-Song Tsay

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Li-Chun Chen - Hsin-I Wu - Ren-Song Tsay

Performance

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper we extend the concept of the traditional transactor, which focuses on correct content transfer, to a new timing-coherent transactor that also accurately aligns the timing of each transaction boundary so that designers can perform precise concurrent system behavior analysis in mixed-abstraction-level system simulations which are essential to increasingly complex system designs. To streamline the process, we also developed an automatic approach for timing-coherent transactor generation. Our approach is actually applied in mixed-level simulations and the results show that it achieves 100% timing accuracy while the conventional approach produces results of 25% to 44% error rate.

rate research

Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations

327 - Markus Wittmann , Georg Hager , Thomas Zeiser 2013

Memory-bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice-Boltzmann method (LBM) on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single-chip performance and power characteristics can be generalized to the highly parallel case. First we perform an analysis of a sparse-lattice LBM implementation for complex geometries. Using a single-core performance model, we predict the intra-chip saturation characteristics and the optimal operating point in terms of energy to solution as a function of implementation details, clock frequency, vectorization, and number of active cores per chip. We show that high single-core performance and a correct choice of the number of active cores per chip are the essential optimizations for lowest energy to solution at minimal performance degradation. Then we extrapolate to the MPI-parallel level and quantify the energy-saving potential of various optimizations and execution modes, where we find these guidelines to be even more important, especially when communication overhead is non-negligible. In our setup we could achieve energy savings of 35% in this case, compared to a naive approach. We also demonstrate that a simple non-reflective reduction of the clock speed leaves most of the energy saving potential unused.

Performance Distributed Parallel and Cluster Computing

LLAMA: The Low Level Abstraction For Memory Access

314 - Bernhard Manfred Gruber , Guilherme Amadio , Jakob Blomer 2021

The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is therefore ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the C++ library LLAMA, which provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. Providing two close-to-life examples, we show that the LLAMA-generated AoS (Array of Struct) and SoA (Struct of Array) layouts produce identical code with the same performance characteristics as manually written data structures. LLAMAs layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying. The library is fully extensible with third-party allocators and allows users to support their own memory layouts with custom mappings.

Performance

Automatic Generation of Level Maps with the Do Whats Possible Representation

66 - Daniel Ashlock , Christoph Salge 2019

Automatic generation of level maps is a popular form of automatic content generation. In this study, a recently developed technique employing the {em do whats possible} representation is used to create open-ended level maps. Generation of the map can continue indefinitely, yielding a highly scalable representation. A parameter study is performed to find good parameters for the evolutionary algorithm used to locate high-quality map generators. Variations on the technique are presented, demonstrating its versatility, and an algorithmic variant is given that both improves performance and changes the character of maps located. The ability of the map to adapt to different regions where the map is permitted to occupy space are also tested.

Artificial Intelligence Neural and Evolutionary Computing

Opening the Black Box: Performance Estimation during Code Generation for GPUs

67 - Dominik Ernst 2021

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning. This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy. Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate. The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.

Performance

BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning

77 - Yuqing Zhu , Jianxun Liu , Mengying Guo 2017

An ever increasing number of configuration parameters are provided to system users. But many users have used one configuration setting across different workloads, leaving untapped the performance potential of systems. A good configuration setting can greatly improve the performance of a deployed system under certain workloads. But with tens or hundreds of parameters, it becomes a highly costly task to decide which configuration setting leads to the best performance. While such task requires the strong expertise in both the system and the application, users commonly lack such expertise. To help users tap the performance potential of systems, we present BestConfig, a system for automatically finding a best configuration setting within a resource limit for a deployed system under a given application workload. BestConfig is designed with an extensible architecture to automate the configuration tuning for general systems. To tune system configurations within a resource limit, we propose the divide-and-diverge sampling method and the recursive bound-and-search algorithm. BestConfig can improve the throughput of Tomcat by 75%, that of Cassandra by 63%, that of MySQL by 430%, and reduce the running time of Hive join job by about 50% and that of Spark join job by about 80%, solely by configuration adjustment.

Performance Databases Distributed Parallel and Cluster Computing