No Arabic abstract
The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configurations performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernels performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernels statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.
Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.
Partial evaluation has recently been used for processing SPARQL queries over a large resource description framework (RDF) graph in a distributed environment. However, the previous approach is inefficient when dealing with complex queries. In this study, we further improve the partial evaluation and assembly framework for answering SPARQL queries over a distributed RDF graph, while providing performance guarantees. Our key idea is to explore the intrinsic structural characteristics of partial matches to filter out irrelevant partial results, while providing performance guarantees on a network trace (data shipment) or the computational cost (response time). We also propose an efficient assembly algorithm to utilize the characteristics of partial matches to merge them and form final results. To improve the efficiency of finding partial matches further, we propose an optimization that communicates variables candidates among sites to avoid redundant computations. In addition, although our approach is partitioning-tolerant, different partitioning strategies result in different performances, and we evaluate different partitioning strategies for our approach. Experiments over both real and synthetic RDF datasets confirm the superiority of our approach.
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.
We consider an extension to the geometric amoebot model that allows amoebots to form so-called emph{circuits}. Given a connected amoebot structure, a circuit is a subgraph formed by the amoebots that permits the instant transmission of signals. We show that such an extension allows for significantly faster solutions to a variety of problems related to programmable matter. More specifically, we provide algorithms for leader election, consensus, compass alignment, chirality agreement and shape recognition. Leader election can be solved in $Theta(log n)$ rounds, w.h.p., consensus in $O(1)$ rounds and both, compass alignment and chirality agreement, can be solved in $O(log n)$ rounds, w.h.p. For shape recognition, the amoebots have to decide whether the amoebot structure forms a particular shape. We show how the amoebots can detect a parallelogram with linear and polynomial side ratio within $Theta(log{n})$ rounds, w.h.p. Finally, we show that the amoebots can detect a shape composed of triangles within $O(1)$ rounds, w.h.p.
This paper proposes an efficient framework to execute Smart Contract Transactions (SCTs) concurrently based on object semantics, using optimistic Single-Version Object-based Software Transactional Memory Systems (SVOSTMs) and Multi-Version OSTMs (MVOSTMs). In our framework, a multi-threaded miner constructs a Block Graph (BG), capturing the object-conflicts relations between SCTs, and stores it in the block. Later, validators re-execute the same SCTs concurrently and deterministically relying on this BG. A malicious miner can modify the BG to harm the blockchain, e.g., to cause double-spending. To identify malicious miners, we propose Smart Multi-threaded Validator (SMV). Experimental analysis shows that the proposed multi-threaded miner and validator achieve significant performance gains over state-of-the-art SCT execution framework.