Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

HMTT: A Hybrid Hardware/Software Tracing System for Bridging Memory Traces Semantic Gap

364 0 0.0 ( 0 )

Download Cite

Added by Yungang Bao

Publication date 2011

fields Informatics Engineering

and research's language is English

Authors Yungang Bao - Jinyong Zhang - Yan Zhu

Hardware Architecture Performance

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Memory trace analysis is an important technology for architecture research, system software (i.e., OS, compiler) optimization, and application performance improvements. Hardware-snooping is an effective and efficient approach to monitor and collect memory traces. Compared with software-based approaches, memory traces collected by hardware-based approaches are usually lack of semantic information, such as process/function/loop identifiers, virtual address and I/O access. In this paper we propose a hybrid hardware/software mechanism which is able to collect memory reference trace as well as semantic information. Based on this mechanism, we designed and implemented a prototype system called HMTT (Hybrid Memory Trace Tool) which adopts a DIMMsnooping mechanism to snoop on memory bus and a software-controlled tracing mechanism to inject semantic information into normal memory trace. To the best of our knowledge, the HMTT system is the first hardware tracing system capable of correlating memory trace with high-level events. Comprehensive validations and evaluations show that the HMTT system has both hardwares (e.g., no distortion or pollution) and softwares advantages (e.g., flexibility and more information).

rate research

Hardware Memory Management for Future Mobile Hybrid Memory Systems

83 - Fei Wen , Mian Qin , Paul Gratz 2020

The current mobile applications have rapidly growing memory footprints, posing a great challenge for memory system design. Insufficient DRAM main memory will incur frequent data swaps between memory and storage, a process that hurts performance, consumes energy and deteriorates the write endurance of typical flash storage devices. Alternately, a larger DRAM has higher leakage power and drains the battery faster. Further, DRAM scaling trends make further growth of DRAMin the mobile space prohibitive due to cost. Emerging non-volatile memory (NVM) has the potential to alleviate these issues due to its higher capacity per cost than DRAM and mini-mal static power. Recently, a wide spectrum of NVM technologies, including phase-change memories (PCM), memristor, and 3D XPoint have emerged. Despite the mentioned advantages, NVM has longer access latency compared to DRAMand NVM writes can incur higher latencies and wear costs. Therefore integration of these new memory technologies in the memory hierarchy requires a fundamental rearchitect-ing of traditional system designs. In this work, we propose a hardware-accelerated memory manager (HMMU) that addresses both types of memory in a flat space address space. We design a set of data placement and data migration policies within this memory manager, such that we may exploit the advantages of each memory technology. By augmenting the system with this HMMU, we reduce the overall memory latency while also reducing writes to the NVM. Experimental results show that our design achieves a 39% reduction in energy consumption with only a 12% performance degradation versus an all-DRAM baseline that is likely untenable in the future.

Hardware Architecture Operating Systems

Memory Slices: A Modular Building Block for Scalable, Intelligent Memory Systems

273 - Bahar Asgari , Saibal Mukhopadhyay , Sudhakar Yalamanchili 2018

While reduction in feature size makes computation cheaper in terms of latency, area, and power consumption, performance of emerging data-intensive applications is determined by data movement. These trends have introduced the concept of scalability as reaching a desirable performance per unit cost by using as few number of units as possible. Many proposals have moved compute closer to the memory. However, these efforts ignored maintaining a balance between bandwidth and compute rate of an architecture, with those of applications, which is a key principle in designing scalable large systems. This paper proposes the use of memory slices, a modular building block for scalable memory systems integrated with compute, in which performance scales with memory size (and volume of data). The slice architecture utilizes a programmable memory interface feeding a systolic compute engine with high reuse rate. The modularity feature of slice-based systems is exploited with a partitioning and data mapping strategy across allocated memory slices where training performance scales with the data size. These features enable shifting the most pressure to cheap compute units rather than expensive memory accesses or transfers via interconnection network. An application of the memory slices to a scale-out memory system is accelerating the training of recurrent, convolutional, and hybrid neural networks (RNNs and RNNs+CNN) that are forming cloud workloads. The results of our cycle-level simulations show that memory slices exhibits a superlinear speedup when the number of slices increases. Furthermore, memory slices improve power efficiency to 747 GFLOPs/J for training LSTMs. While our current evaluation uses memory slices with 3D packaging, a major value is that slices can also be constructed with a variety of packaging options, for example with DDR-based memory units.

Hardware Architecture Performance

Fuzzing Hardware Like Software

79 - Timothy Trippel , Kang G. Shin , Alex Chernyakhovsky 2021

Hardware flaws are permanent and potent: hardware cannot be patched once fabricated, and any flaws may undermine any software executing on top. Consequently, verification time dominates implementation time. The gold standard in hardware Design Verification (DV) is concentrated at two extremes: random dynamic verification and formal verification. Both struggle to root out the subtle flaws in complex hardware that often manifest as security vulnerabilities. The root problem with random verification is its undirected nature, making it inefficient, while formal verification is constrained by the state-space explosion problem, making it infeasible against complex designs. What is needed is a solution that is directed, yet under-constrained. Instead of making incremental improvements to existing DV approaches, we leverage the observation that existing software fuzzers already provide such a solution, and adapt them for hardware DV. Specifically, we translate RTL hardware to a software model and fuzz that model. The central challenge we address is how best to mitigate the differences between the hardware execution model and software execution model. This includes: 1) how to represent test cases, 2) what is the hardware equivalent of a crash, 3) what is an appropriate coverage metric, and 4) how to create a general-purpose fuzzing harness for hardware. To evaluate our approach, we fuzz four IP blocks from Googles OpenTitan SoC. Our experiments reveal a two orders-of-magnitude reduction in run time to achieve Finite State Machine (FSM) coverage over traditional dynamic verification schemes. Moreover, with our design-agnostic harness, we achieve over 88% HDL line coverage in three out of four of our designs -- even without any initial seeds.

Hardware Architecture Cryptography and Security

Clio: A Hardware-Software Co-Designed Disaggregated Memory System

164 - Zhiyuan Guo , Yizhou Shan , Xuhao Luo 2021

Memory disaggregation has attracted great attention recently because of its benefits in efficient memory utilization and ease of management. So far, memory disaggregation research has all taken one of two approaches, building/emulating memory nodes with either regular servers or raw memory devices with no processing power. The former incurs higher monetary cost and face tail latency and scalability limitations, while the latter introduce performance, security, and management problems. Server-based memory nodes and memory nodes with no processing power are two extreme approaches. We seek a sweet spot in the middle by proposing a hardware-based memory disaggregation solution that has the right amount of processing power at memory nodes. Furthermore, we take a clean-slate approach by starting from the requirements of memory disaggregation and designing a memory-disaggregation-native system. We propose a hardware-based disaggregated memory system, Clio, that virtualizes and manages disaggregated memory at the memory node. Clio includes a new hardware-based virtual memory system, a customized network system, and a framework for computation offloading. In building Clio, we not only co-design OS functionalities, hardware architecture, and the network system, but also co-design the compute node and memory node. We prototyped Clios memory node with FPGA and implemented its client-node functionalities in a user-space library. Clio achieves 100 Gbps throughput and an end-to-end latency of 2.5 us at median and 3.2 us at the 99th percentile. Clio scales much better and has orders of magnitude lower tail latency than RDMA, and it has 1.1x to 3.4x energy saving compared to CPU-based and SmartNIC-based disaggregated memory systems and is 2.7x faster than software-based SmartNIC solutions.

Distributed Parallel and Cluster Computing

Polynesia: Enabling Effective Hybrid Transactional/Analytical Databases with Specialized Hardware/Software Co-Design

131 - Amirali Boroumand , Saugata Ghose , Geraldo F. Oliveira 2021

An exponential growth in data volume, combined with increasing demand for real-time analysis (i.e., using the most recent data), has resulted in the emergence of database systems that concurrently support transactions and data analytics. These hybrid transactional and analytical processing (HTAP) database systems can support real-time data analysis without the high costs of synchronizing across separate single-purpose databases. Unfortunately, for many applications that perform a high rate of data updates, state-of-the-art HTAP systems incur significant drops in transactional (up to 74.6%) and/or analytical (up to 49.8%) throughput compared to performing only transactions or only analytics in isolation, due to (1) data movement between the CPU and memory, (2) data update propagation, and (3) consistency costs. We propose Polynesia, a hardware-software co-designed system for in-memory HTAP databases. Polynesia (1) divides the HTAP system into transactional and analytical processing islands, (2) implements custom algorithms and hardware to reduce the costs of update propagation and consistency, and (3) exploits processing-in-memory for the analytical islands to alleviate data movement. Our evaluation shows that Polynesia outperforms three state-of-the-art HTAP systems, with average transactional/analytical throughput improvements of 1.70X/3.74X, and reduces energy consumption by 48% over the prior lowest-energy system.

Hardware Architecture Databases

comments

Fetching comments

Yarmouk Private University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

HMTT: A Hybrid Hardware/Software Tracing System for Bridging Memory Traces Semantic Gap

Ask ChatGPT about the research

No Arabic abstract

Read More