No Arabic abstract
Non-volatile random access memory (NVRAM) offers byte-addressable persistence at speeds comparable to DRAM. However, with caches remaining volatile, automatic cache evictions can reorder updates to memory, potentially leaving persistent memory in an inconsistent state upon a system crash. Flush and fence instructions can be used to force ordering among updates, but are expensive. This has motivated significant work studying how to write correct and efficient persistent programs for NVRAM. In this paper, we present FliT, a C++ library that facilitates writing efficient persistent code. Using the librarys default mode makes any linearizable data structure durable with minimal changes to the code. FliT avoids many redundant flush instructions by using a novel algorithm to track dirty cache lines. The FliT library also allows for extra optimizations, but achieves good performance even in its default setting. To describe the FliT librarys capabilities and guarantees, we define a persistent programming interface, called the P-V Interface, which FliT implements. The P-V Interface captures the expected behavior of code in which some instructions effects are persisted and some are not. We show that the interface captures the desired semantics of many practical algorithms in the literature. We apply the FliT library to four different persistent data structures, and show that across several workloads, persistence implementations, and data structure sizes, the FliT library always improves operation throughput, by at least $2.1times$ over a naive implementation in all but one workload.
Gradecast is a simple three-round algorithm presented by Feldman and Micali. The current work presents a very simple algorithm that utilized Gradecast to achieve Byzantine agreement. Two small variations of the presented algorithm lead to improved algorithms for solving the Approximate agreement problem and the Multi-consensus problem. An optimal approximate agreement algorithm was presented by Fekete, which supports up to 1/4 n Byzantine nodes and has message complexity of O(n^k), where n is the number of nodes and k is the number of rounds. Our solution to the approximate agreement problem is optimal, simple and reduces the message complexity to O(k * n^3), while supporting up to 1/3 n Byzantine nodes. Multi consensus was first presented by Bar-Noy et al. It consists of consecutive executions of l Byzantine consensuses. Bar-Noy et al., show an optimal amortized solution to this problem, assuming that all nodes start each consensus instance at the same time, a property that cannot be guaranteed with early stopping. Our solution is simpler, preserves round complexity optimality, allows early stopping and does not require synchronized starts of the consensus instances.
A surge in artificial intelligence and autonomous technologies have increased the demand toward enhanced edge-processing capabilities. Computational complexity and size of state-of-the-art Deep Neural Networks (DNNs) are rising exponentially with diverse network models and larger datasets. This growth limits the performance scaling and energy-efficiency of both distributed and embedded inference platforms. Embedded designs at the edge are constrained by energy and speed limitations of available processor substrates and processor to memory communication required to fetch the model coefficients. While many hardware accelerator and network deployment frameworks have been in development, a framework is needed to allow the variety of existing architectures, and those in development, to be expressed in critical parts of the flow that perform various optimization steps. Moreover, premature architecture-blind network selection and optimization diminish the effectiveness of schedule optimizations and hardware-specific mappings. In this paper, we address these issues by creating a cross-layer software-hardware design framework that encompasses network training and model compression that is aware of and tuned to the underlying hardware architecture. This approach leverages the available degrees of DNN structure and sparsity to create a converged network that can be partitioned and efficiently scheduled on the target hardware platform, minimizing data movement, and improving the overall throughput and energy. To further streamline the design, we leverage the high-level, flexible SoC generator platform based on RISC-V ROCC framework. This integration allows seamless extensions of the RISC-V instruction set and Chisel-based rapid generator design. Utilizing this approach, we implemented a silicon prototype in a 16 nm TSMC process node achieving record processing efficiency of up to 18 TOPS/W.
It was recently shown that a version of the greedy algorithm gives a construction of fault-tolerant spanners that is size-optimal, at least for vertex faults. However, the algorithm to construct this spanner is not polynomial-time, and the best-known polynomial time algorithm is significantly suboptimal. Designing a polynomial-time algorithm to construct (near-)optimal fault-tolerant spanners was given as an explicit open problem in the two most recent papers on fault-tolerant spanners ([Bodwin, Dinitz, Parter, Vassilevka Williams SODA 18] and [Bodwin, Patel PODC 19]). We give a surprisingly simple algorithm which runs in polynomial time and constructs fault-tolerant spanners that are extremely close to optimal (off by only a linear factor in the stretch) by modifying the greedy algorithm to run in polynomial time. To complement this result, we also give simple distributed constructions in both the LOCAL and CONGEST models.
In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.
The problem of attaining energy efficiency in distributed systems is of importance, but a general, non-domain-specific theory of energy-minimal scheduling is far from developed. In this paper, we classify the problems of energy-minimal scheduling and present theoretical foundations of the same. We derive results concerning energy-minimal scheduling of independent jobs in a distributed system with functionally similar machines with different working and idle power ratings. The machines considered in our system can have identical as well as different speeds. If the jobs can be divided into arbitrary parts, we show that the minimum-energy schedule can be generated in linear time and give exact scheduling algorithms. For the cases where jobs are non-divisible, we prove that the scheduling problems are NP-hard and also give approximation algorithms for the same along with their bounds.