Subscribe to the gold package and get unlimited access to Shamra Academy

LLAMA: The Low Level Abstraction For Memory Access

315 0 0.0 ( 0 )

Download Cite

Added by Bernhard Manfred Gruber

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Bernhard Manfred Gruber - Guilherme Amadio - Jakob Blomer

Performance

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is therefore ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the C++ library LLAMA, which provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. Providing two close-to-life examples, we show that the LLAMA-generated AoS (Array of Struct) and SoA (Struct of Array) layouts produce identical code with the same performance characteristics as manually written data structures. LLAMAs layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying. The library is fully extensible with third-party allocators and allows users to support their own memory layouts with custom mappings.

rate research

Optimizing Memory-Access Patterns for Deep Learning Accelerators

86 - Hongbin Zheng , Sejong Oh , Huiqing Wang 2020

Deep learning (DL) workloads are moving towards accelerators for faster processing and lower cost. Modern DL accelerators are good at handling the large-scale multiply-accumulate operations that dominate DL workloads; however, it is challenging to make full use of the compute power of an accelerator since the data must be properly staged in a software-managed scratchpad memory. Failing to do so can result in significant performance loss. This paper proposes a systematic approach which leverages the polyhedral model to analyze all operators of a DL model together to minimize the number of memory accesses. Experiments show that our approach can substantially reduce the impact of memory accesses required by common neural-network models on a homegrown AWS machine-learning inference chip named Inferentia, which is available through Amazon EC2 Inf1 instances.

Performance Computation and Language

Pinpointing the Memory Behaviors of DNN Training

75 - Jiansong Li , Xiao Dong , Guangli Li 2021

The training of deep neural networks (DNNs) is usually memory-hungry due to the limited device memory capacity of DNN accelerators. Characterizing the memory behaviors of DNN training is critical to optimize the device memory pressures. In this work, we pinpoint the memory behaviors of each device memory block of GPU during training by instrumenting the memory allocators of the runtime system. Our results show that the memory access patterns of device memory blocks are stable and follow an iterative fashion. These observations are useful for the future optimization of memory-efficient training from the perspective of raw memory access patterns.

Performance Machine Learning

Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency

86 - Hasan Hassan , Gennady Pekhimenko , Nandita Vijaykumar 2018

This paper summarizes the idea of ChargeCache, which was published in HPCA 2016 [51], and examines the works significance and future potential. DRAM latency continues to be a critical bottleneck for system performance. In this work, we develop a low-cost mechanism, called ChargeCache, that enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems.

Hardware Architecture

DJXPerf: Identifying Memory Inefficiencies via Object-centric Profiling for Java

74 - Bolun Li , Pengfei Su , Milind Chabbi 2021

Java is the go-to programming language choice for developing scalable enterprise cloud applications. In such systems, even a few percent CPU time savings can offer a significant competitive advantage and cost saving. Although performance tools abound in Java, those that focus on the data locality in the memory hierarchy are rare. In this paper, we present DJXPerf, a lightweight, object-centric memory profiler for Java, which associates memory-hierarchy performance metrics (e.g., cache/TLB misses) with Java objects. DJXPerf uses statistical sampling of hardware performance monitoring counters to attribute metrics to not only source code locations but also Java objects. DJXPerf presents Java object allocation contexts combined with their usage contexts and presents them ordered by the poor locality behaviors. DJXPerfs performance measurement, object attribution, and presentation techniques guide optimizing object allocation, layout, and access patterns. DJXPerf incurs only ~8% runtime overhead and ~5% memory overhead on average, requiring no modifications to hardware, OS, Java virtual machine, or application source code, which makes it attractive to use in production. Guided by DJXPerf, we study and optimize a number of Java and Scala programs, including well-known benchmarks and real-world applications, and demonstrate significant speedups.

Performance Programming Languages

Automatic Timing-Coherent Transactor Generation for Mixed-level Simulations

131 - Li-Chun Chen , Hsin-I Wu , Ren-Song Tsay 2021

In this paper we extend the concept of the traditional transactor, which focuses on correct content transfer, to a new timing-coherent transactor that also accurately aligns the timing of each transaction boundary so that designers can perform precise concurrent system behavior analysis in mixed-abstraction-level system simulations which are essential to increasingly complex system designs. To streamline the process, we also developed an automatic approach for timing-coherent transactor generation. Our approach is actually applied in mixed-level simulations and the results show that it achieves 100% timing accuracy while the conventional approach produces results of 25% to 44% error rate.

Performance

comments

Fetching comments

University of Mosul

Additional details More universities

LLAMA: The Low Level Abstraction For Memory Access

Ask ChatGPT about the research

No Arabic abstract

Read More