ﻻ يوجد ملخص باللغة العربية
The Sphynx project was an exploratory study to discover what might be done to improve the heavy replication of in- structions in independent instruction caches for a massively parallel machine where a single program is executing across all of the cores. While a machine with only many cores (fewer than 50) might not have any issues replicating the instructions for each core, as we approach the era where thousands of cores can be placed on one chip, the overhead of instruction replication may become unacceptably large. We believe that a large amount of sharing should be possible when the ma- chine is configured for all of the threads to issue from the same set of instructions. We propose a technique that allows sharing an instruction cache among a number of independent processor cores to allow for inter-thread sharing and reuse of instruction memory. While we do not have test cases to demonstrate the potential magnitude of performance gains that could be achieved, the potential for sharing reduces the die area required for instruction storage on chip.
Prior work has observed that fetch-directed prefetching (FDIP) is highly effective at covering instruction cache misses. The key to FDIPs effectiveness is having a sufficiently large BTB to accommodate the applications branch working set. In this wor
In recent years, machine intelligence (MI) applications have emerged as a major driver for the computing industry. Optimizing these workloads is important but complicated. As memory demands grow and data movement overheads increasingly limit performa
Modern Systems-on-Chip (SoC) designs are increasingly heterogeneous and contain specialized semi-programmable accelerators in addition to programmable processors. In contrast to the pre-accelerator era, when the ISA played an important role in verifi
The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a single GPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with GPU count be
Cache coherence protocols such as MESI that use writer-initiated invalidation have high complexity and sometimes have poor performance and energy usage, especially under false sharing. Such protocols require numerous transient states, a shared direct