On Transformations of Load-Store Maurer Instruction Set Architecture

638 0 0.0 ( 0 )

Download Cite

Added by Tie Hou

Publication date 2009

fields Informatics Engineering

and research's language is English

Authors Tie Hou

Hardware Architecture

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper, we study how certain conditions can affect the transformations on the states of the memory of a strict load-store Maurer ISA, when half of the data memory serves as the part of the operating unit.

rate research

SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems

136 - Maciej Besta , Raghavendra Kanakagiri , Grzegorz Kwasniewski 2021

Simple graph algorithms such as PageRank have recently been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listing. These algorithms are memory-bound and thus could be accelerated by hardware techniques such as Processing-in-Memory (PIM). However, they also come with non-straightforward parallelism and complicated memory access patterns. In this work, we address this with a simple yet surprisingly powerful observation: operations on sets of vertices, such as intersection or union, form a large part of many complex graph mining algorithms, and can offer rich and simple parallelism at multiple levels. This observation drives our cross-layer design, in which we (1) expose set operations using a novel programming paradigm, (2) express and execute these operations efficiently with carefully designed set-centric ISA extensions called SISA, and (3) use PIM to accelerate SISA instructions. The key design idea is to alleviate the bandwidth needs of SISA instructions by mapping set operations to two types of PIM: in-DRAM bulk bitwise computing for bitvectors representing high-degree vertices, and near-memory logic layers for integer arrays representing low-degree vertices. Set-centric SISA-enhanced algorithms are efficient and outperform hand-tuned baselines, offering more than 10x speedup over the established Bron-Kerbosch algorithm for listing maximal cliques. We deliver more than 10 SISA set-centric algorithm formulations, illustrating SISAs wide applicability.

Hardware Architecture Distributed Parallel and Cluster Computing Data Structures and Algorithms

On the operating unit size of load/store architectures

346 - J. A. Bergstra , C. A. Middelburg 2008

We introduce a strict version of the concept of a load/store instruction set architecture in the setting of Maurer machines. We take the view that transformations on the states of a Maurer machine are achieved by applying threads as considered in thread algebra to the Maurer machine. We study how the transformations on the states of the main memory of a strict load/store instruction set architecture that can be achieved by applying threads depend on the operating unit size, the cardinality of the instruction set, and the maximal number of states of the threads.

Hardware Architecture

ISEGEN: Generation of High-Quality Instruction Set Extensions by Iterative Improvement

616 - Partha Biswas , Sudarshan Banerjee , Nikil Dutt 2007

Customization of processor architectures through Instruction Set Extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation approach needs to obtain results close to those achieved by experienced designers, particularly for complex applications that exhibit regularity: expert designers are able to exploit manually such regularity in the data flow graphs to generate high-quality ISEs. In this paper, we present ISEGEN, an approach that identifies high-quality ISEs by iterative improvement following the basic principles of the well-known Kernighan-Lin (K-L) min-cut heuristic. Experimental results on a number of MediaBench, EEMBC and cryptographic applications show that our approach matches the quality of the optimal solution obtained by exhaustive search. We also show that our ISEGEN technique is on average 20x faster than a genetic formulation that generates equivalent solutions. Furthermore, the ISEs identified by our technique exhibit 35% more speedup than the genetic solution on a large cryptographic application (AES) by effectively exploiting its regular structure.

Hardware Architecture

Fetch-Directed Instruction Prefetching Revisited

68 - Truls Asheim , Rakesh Kumar , Boris Grot 2020

Prior work has observed that fetch-directed prefetching (FDIP) is highly effective at covering instruction cache misses. The key to FDIPs effectiveness is having a sufficiently large BTB to accommodate the applications branch working set. In this work, we introduce several optimizations that significantly extend the reach of the BTB within the available storage budget. Our optimizations target nearly every source of storage overhead in each BTB entry; namely, the tag, target address, and size fields. We observe that while most dynamic branch instances have short offsets, a large number of branches has longer offsets or requires the use of full target addresses. Based on this insight, we break-up the BTB into multiple smaller BTBs, each storing offsets of different length. This enables a dramatic reduction in storage for target addresses. We further compress tags to 16 bits and avoid the use of the basic-block-oriented BTB advocated in prior FDIP variants. The latter optimization eliminates the need to store the basic block size in each BTB entry. Our final design, called FDIP-X, uses an ensemble of 4 BTBs and always outperforms conventional FDIP with a unified basic-block-oriented BTB for equal storage budgets.

Hardware Architecture

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

98 - Farhad Merchant , Tarun Vatwani , Anupam Chattopadhyay 2016

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the theoretical peak performance at 65W to 240W respectively for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM). For bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) the performance is merely 5 to 7% of the theoretical peak performance in multicores and GPGPUs respectively. Achieving performance in BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS through algorithm-architecture co-design. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units (CFUs). For efficient sequential realization of BLAS, we present design of a Processing Element (PE) and perform micro-architectural enhancements in the PE to achieve up-to 74% of the theoretical peak performance of PE in DGEMM, 40% in DGEMV and 20% in double precision inner product (DDOT). We attach this PE to REDEFINE CGRA as a CFU and show the scalability of our solution. Finally, we show performance improvement of 3-140x in PE over commercially available Intel micro-architectures, ClearSpeed CSX700, FPGA, and Nvidia GPGPUs.

Hardware Architecture Mathematical Software

comments

Fetching comments

Higher Institute of Business Administration

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

On Transformations of Load-Store Maurer Instruction Set Architecture

Ask ChatGPT about the research

No Arabic abstract

In this paper, we study how certain conditions can affect the transformations on the states of the memory of a strict load-store Maurer ISA, when half of the data memory serves as the part of the operating unit.

Read More