Do you want to publish a course? Click here

On Transformations of Load-Store Maurer Instruction Set Architecture

218   0   0.0 ( 0 )
 Added by Tie Hou
 Publication date 2009
and research's language is English
 Authors Tie Hou




Ask ChatGPT about the research

In this paper, we study how certain conditions can affect the transformations on the states of the memory of a strict load-store Maurer ISA, when half of the data memory serves as the part of the operating unit.



rate research

Read More

Simple graph algorithms such as PageRank have recently been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listing. These algorithms are memory-bound and thus could be accelerated by hardware techniques such as Processing-in-Memory (PIM). However, they also come with non-straightforward parallelism and complicated memory access patterns. In this work, we address this with a simple yet surprisingly powerful observation: operations on sets of vertices, such as intersection or union, form a large part of many complex graph mining algorithms, and can offer rich and simple parallelism at multiple levels. This observation drives our cross-layer design, in which we (1) expose set operations using a novel programming paradigm, (2) express and execute these operations efficiently with carefully designed set-centric ISA extensions called SISA, and (3) use PIM to accelerate SISA instructions. The key design idea is to alleviate the bandwidth needs of SISA instructions by mapping set operations to two types of PIM: in-DRAM bulk bitwise computing for bitvectors representing high-degree vertices, and near-memory logic layers for integer arrays representing low-degree vertices. Set-centric SISA-enhanced algorithms are efficient and outperform hand-tuned baselines, offering more than 10x speedup over the established Bron-Kerbosch algorithm for listing maximal cliques. We deliver more than 10 SISA set-centric algorithm formulations, illustrating SISAs wide applicability.
We introduce a strict version of the concept of a load/store instruction set architecture in the setting of Maurer machines. We take the view that transformations on the states of a Maurer machine are achieved by applying threads as considered in thread algebra to the Maurer machine. We study how the transformations on the states of the main memory of a strict load/store instruction set architecture that can be achieved by applying threads depend on the operating unit size, the cardinality of the instruction set, and the maximal number of states of the threads.
Customization of processor architectures through Instruction Set Extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation approach needs to obtain results close to those achieved by experienced designers, particularly for complex applications that exhibit regularity: expert designers are able to exploit manually such regularity in the data flow graphs to generate high-quality ISEs. In this paper, we present ISEGEN, an approach that identifies high-quality ISEs by iterative improvement following the basic principles of the well-known Kernighan-Lin (K-L) min-cut heuristic. Experimental results on a number of MediaBench, EEMBC and cryptographic applications show that our approach matches the quality of the optimal solution obtained by exhaustive search. We also show that our ISEGEN technique is on average 20x faster than a genetic formulation that generates equivalent solutions. Furthermore, the ISEs identified by our technique exhibit 35% more speedup than the genetic solution on a large cryptographic application (AES) by effectively exploiting its regular structure.
Prior work has observed that fetch-directed prefetching (FDIP) is highly effective at covering instruction cache misses. The key to FDIPs effectiveness is having a sufficiently large BTB to accommodate the applications branch working set. In this work, we introduce several optimizations that significantly extend the reach of the BTB within the available storage budget. Our optimizations target nearly every source of storage overhead in each BTB entry; namely, the tag, target address, and size fields. We observe that while most dynamic branch instances have short offsets, a large number of branches has longer offsets or requires the use of full target addresses. Based on this insight, we break-up the BTB into multiple smaller BTBs, each storing offsets of different length. This enables a dramatic reduction in storage for target addresses. We further compress tags to 16 bits and avoid the use of the basic-block-oriented BTB advocated in prior FDIP variants. The latter optimization eliminates the need to store the basic block size in each BTB entry. Our final design, called FDIP-X, uses an ensemble of 4 BTBs and always outperforms conventional FDIP with a unified basic-block-oriented BTB for equal storage budgets.
Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the theoretical peak performance at 65W to 240W respectively for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM). For bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) the performance is merely 5 to 7% of the theoretical peak performance in multicores and GPGPUs respectively. Achieving performance in BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS through algorithm-architecture co-design. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units (CFUs). For efficient sequential realization of BLAS, we present design of a Processing Element (PE) and perform micro-architectural enhancements in the PE to achieve up-to 74% of the theoretical peak performance of PE in DGEMM, 40% in DGEMV and 20% in double precision inner product (DDOT). We attach this PE to REDEFINE CGRA as a CFU and show the scalability of our solution. Finally, we show performance improvement of 3-140x in PE over commercially available Intel micro-architectures, ClearSpeed CSX700, FPGA, and Nvidia GPGPUs.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا