No Arabic abstract
With the rapid development of scientific computation, more and more researchers and developers are committed to implementing various workloads/operations on different devices. Among all these devices, NVIDIA GPU is the most popular choice due to its comprehensive documentation and excellent development tools. As a result, there are abundant resources for hand-writing high-performance CUDA codes. However, CUDA is mainly supported by only commercial products and there has been no support for open-source H/W platforms. RISC-V is the most popular choice for hardware ISA, thanks to its elegant design and open-source license. In this project, we aim to utilize these existing CUDA codes with RISC-V devices. More specifically, we design and implement a pipeline that can execute CUDA source code on an RISC-V GPU architecture. We have succeeded in executing CUDA kernels with several important features, like multi-thread and atomic instructions, on an RISC-V GPU architecture.
RISC-V is a relatively new, open instruction set architecture with a mature ecosystem and an official formal machine-readable specification. It is therefore a promising playground for formal-methods research. However, we observe that different formal-methods research projects are interested in different aspects of RISC-V and want to simplify, abstract, approximate, or ignore the other aspects. Often, they also require different encoding styles, resulting in each project starting a new formalization from-scratch. We set out to identify the commonalities between projects and to represent the RISC-V specification as a program with holes that can be instantiated differently by different projects. Our formalization of the RISC-V specification is written in Haskell and leverages existing tools rather than requiring new domain-specific tools, contrary to other approaches. To our knowledge, it is the first RISC-V specification able to serve as the interface between a processor-correctness proof and a compiler-correctness proof, while supporting several other projects with diverging requirements as well.
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmers burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single processs memory. This eliminates the high overhead of inter-process communication in earlier approaches, and has fewer limitations.
Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data. Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets. We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage. We also hierarchically reorder input elements to achieve better coalescing of global memory accesses. On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93 Gkeys/s (or 11.68 Gpairs/s) for a key-only (or key-value) multisplit. Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0 G keys/s (and 2.1 Gpair/s).
We introduce a type and effect system, for an imperative object calculus, which infers sharing possibly introduced by the evaluation of an expression, represented as an equivalence relation among its free variables. This direct representation of sharing effects at the syntactic level allows us to express in a natural way, and to generalize, widely-used notions in literature, notably uniqueness and borrowing. Moreover, the calculus is pure in the sense that reduction is defined on language terms only, since they directly encode store. The advantage of this non-standard execution model with respect to a behaviourally equivalent standard model using a global auxiliary structure is that reachability relations among references are partly encoded by scoping.
As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a constraint for concurrent programs that require relative forward progress to terminate. Using this constraint, we synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, this allows us to determine the expected status of each litmus test -- i.e. whether it is guaranteed eventual termination -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, an intuitive progress model defined by prior work and hypothesized to describe the workgroup schedulers of existing GPUs.