ﻻ يوجد ملخص باللغة العربية
Prior work has observed that fetch-directed prefetching (FDIP) is highly effective at covering instruction cache misses. The key to FDIPs effectiveness is having a sufficiently large BTB to accommodate the applications branch working set. In this work, we introduce several optimizations that significantly extend the reach of the BTB within the available storage budget. Our optimizations target nearly every source of storage overhead in each BTB entry; namely, the tag, target address, and size fields. We observe that while most dynamic branch instances have short offsets, a large number of branches has longer offsets or requires the use of full target addresses. Based on this insight, we break-up the BTB into multiple smaller BTBs, each storing offsets of different length. This enables a dramatic reduction in storage for target addresses. We further compress tags to 16 bits and avoid the use of the basic-block-oriented BTB advocated in prior FDIP variants. The latter optimization eliminates the need to store the basic block size in each BTB entry. Our final design, called FDIP-X, uses an ensemble of 4 BTBs and always outperforms conventional FDIP with a unified basic-block-oriented BTB for equal storage budgets.
We propose an approach to data memory prefetching which augments the standard prefetch buffer with selection criteria based on performance and usage pattern of a given instruction. This approach is built on top of a pattern matching based prefetcher,
The Sphynx project was an exploratory study to discover what might be done to improve the heavy replication of in- structions in independent instruction caches for a massively parallel machine where a single program is executing across all of the cor
In this paper, we study how certain conditions can affect the transformations on the states of the memory of a strict load-store Maurer ISA, when half of the data memory serves as the part of the operating unit.
Customization of processor architectures through Instruction Set Extensions (ISEs) is an effective way to meet the growing performance demands of embedded applications. A high-quality ISE generation approach needs to obtain results close to those ach
Modern Systems-on-Chip (SoC) designs are increasingly heterogeneous and contain specialized semi-programmable accelerators in addition to programmable processors. In contrast to the pre-accelerator era, when the ISA played an important role in verifi