No Arabic abstract
As memory capacity has outstripped TLB coverage, large data applications suffer from frequent page table walks. We investigate two complementary techniques for addressing this cost: reducing the number of accesses required and reducing the latency of each access. The first approach is accomplished by opportunistically flattening the page table: merging two levels of traditional 4KB page table nodes into a single 2MB node, thereby reducing the tables depth and the number of indirections required to search it. The second is accomplished by biasing the cache replacement algorithm to keep page table entries during periods of high TLB miss rates, as these periods also see high data miss rates and are therefore more likely to benefit from having the smaller page table in the cache than to suffer from increased data cache misses. We evaluate these approaches for both native and virtualized systems and across a range of realistic memory fragmentation scenarios, describe the limited changes needed in our kernel implementation and hardware design, identify and address challenges related to self-referencing page tables and kernel memory allocation, and compare results across server and mobile systems using both academic and industrial simulators for robustness. We find that flattening does reduce the number of accesses required on a page walk (to 1.0), but its performance impact (+2.3%) is small due to Page Walker Caches (already 1.5 accesses). Prioritizing caching has a larger effect (+6.8%), and the combination improves performance by +9.2%. Flattening is more effective on virtualized systems (4.4 to 2.8 accesses, +7.1% performance), due to 2D page walks. By combining the two techniques we demonstrate a state-of-the-art +14.0% performance gain and -8.7% dynamic cache energy and -4.7% dynamic DRAM energy for virtualized execution with very simple hardware and software changes.
Asymptotic Causal Diamonds (ACDs) are a natural flat space analogue of AdS causal wedges, and it has been argued previously that they may be useful for understanding bulk locality in flat space holography. In this paper, we use ACD-inspired ideas to argue that there exist natural candidates for Quantum Extremal Surfaces (QES) and entanglement wedges in flat space, anchored to the conformal boundary. When there is a holographic screen at finite radius, we can also associate entanglement wedges and entropies to screen sub-regions, with the system naturally coupled to a sink. The screen and the boundary provide two complementary ways of formulating the information paradox. We explain how they are related and show that in both formulations, the flat space entanglement wedge undergoes a phase transition at the Page time in the background of an evaporating Schwarzschild black hole. Our results closely parallel recent observations in AdS, and reproduce the Page curve. That there is a variation of the argument that can be phrased directly in flat space without reliance on AdS, is a strong indication that entanglement wedge phase transitions may be key to the information paradox in flat space as well. Along the way, we give evidence that the entanglement entropy of an ACD is a well-defined, and likely instructive, quantity. We further note that the picture of the sink we present here may have an understanding in terms of sub-matrix deconfinement in a large-$N$ setting.
Hybrid memory systems comprised of dynamic random access memory (DRAM) and non-volatile memory (NVM) have been proposed to exploit both the capacity advantage of NVM and the latency and dynamic energy advantages of DRAM. An important problem for such systems is how to place data between DRAM and NVM to improve system performance. In this paper, we devise the first mechanism, called UBM (page Utility Based hybrid Memory management), that systematically estimates the system performance benefit of placing a page in DRAM versus NVM and uses this estimate to guide data placement. UBMs estimation method consists of two major components. First, it estimates how much an applications stall time can be reduced if the accessed page is placed in DRAM. To do this, UBM comprehensively considers access frequency, row buffer locality, and memory level parallelism (MLP) to estimate the applications stall time reduction. Second, UBM estimates how much each applications stall time reduction contributes to overall system performance. Based on this estimation method, UBM can determine and place the most critical data in DRAM to directly optimize system performance. Experimental results show that UBM improves system performance by 14% on average (and up to 39%) compared to the best of three state-of-the-art mechanisms for a large number of data-intensive workloads from the SPEC CPU2006 and Yahoo Cloud Serving Benchmark (YCSB) suites.
Data movement between main memory and the processor is a significant contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM), which enables computation inside the memory chip. However, existing PiM architectures often lack support for complex operations, since supporting these operations increases design complexity, chip area, and power consumption. We introduce pLUTo (processing-in-memory with lookup table [LUT] operations), a new DRAM substrate that leverages the high area density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The use of LUTs enables the efficient execution of complex operations in-memory, which has been a long-standing challenge in the domain of PiM. When running a state-of-the-art binary neural network in a single DRAM subarray, pLUTo outperforms the baseline CPU and GPU implementations by $33times$ and $8times$, respectively, while simultaneously achieving energy savings of $110times$ and $80times$.
Rowhammer attacks that corrupt level-1 page tables to gain kernel privilege are the most detrimental to system security and hard to mitigate. However, recently proposed software-only mitigations are not effective against such kernel privilege escalation attacks. In this paper, we propose an effective and practical software-only defense, called SoftTRR, to protect page tables from all existing rowhammer attacks on x86. The key idea of SoftTRR is to refresh the rows occupied by page tables when a suspicious rowhammer activity is detected. SoftTRR is motivated by DRAM-chip-based target row refresh (ChipTRR) but eliminates its main security limitation (i.e., ChipTRR tracks a limited number of rows and thus can be bypassed by many-sided hammer). Specifically, SoftTRR protects an unlimited number of page tables by tracking memory accesses to the rows that are in close proximity to page-table rows and refreshing the page-table rows once the tracked access count exceeds a pre-defined threshold. We implement a prototype of SoftTRR as a loadable kernel module, and evaluate its security effectiveness, performance overhead, and memory consumption. The experimental results show that SoftTRR protects page tables from real-world rowhammer attacks and incurs small performance overhead as well as memory cost.
Quantum interference on the kagome lattice generates electronic bands with narrow bandwidth, called flat bands. Crystal structures incorporating this lattice can host strong electron correlations with non-standard ingredients, but only if these bands lie at the Fermi level. In the six compounds with the CoSn structure type (FeGe, FeSn, CoSn, NiIn, RhPb, and PtTl) the transition metals form a kagome lattice. The two iron variants are robust antiferromagnets so we focus on the latter four and investigate their thermodynamic and transport properties. We consider these results and calculated band structures to locate and characterize the flat bands in these materials. We propose that CoSn and RhPb deserve the communitys attention for exploring flat band physics.