Do you want to publish a course? Click here

Accelerating Amoebots via Reconfigurable Circuits

123   0   0.0 ( 0 )
 Publication date 2021
and research's language is English




Ask ChatGPT about the research

We consider an extension to the geometric amoebot model that allows amoebots to form so-called emph{circuits}. Given a connected amoebot structure, a circuit is a subgraph formed by the amoebots that permits the instant transmission of signals. We show that such an extension allows for significantly faster solutions to a variety of problems related to programmable matter. More specifically, we provide algorithms for leader election, consensus, compass alignment, chirality agreement and shape recognition. Leader election can be solved in $Theta(log n)$ rounds, w.h.p., consensus in $O(1)$ rounds and both, compass alignment and chirality agreement, can be solved in $O(log n)$ rounds, w.h.p. For shape recognition, the amoebots have to decide whether the amoebot structure forms a particular shape. We show how the amoebots can detect a parallelogram with linear and polynomial side ratio within $Theta(log{n})$ rounds, w.h.p. Finally, we show that the amoebots can detect a shape composed of triangles within $O(1)$ rounds, w.h.p.



rate research

Read More

We show that piezoelectric strain actuation of acoustomechanical interactions can produce large phase velocity changes in an existing quantum phononic platform: aluminum nitride on suspended silicon. Using finite element analysis, we demonstrate a piezo-acoustomechanical phase shifter waveguide capable of producing +/- pi phase shifts for GHz frequency phonons in 10s of microns with 10s of volts applied. Then, using the phase shifter as a building block, we demonstrate several phononic integrated circuit elements useful for quantum information processing. In particular, we show how to construct programmable multi-mode interferometers for linear phononic processing and a dynamically reconfigurable phononic memory that can switch between an ultra-long-lifetime state and a state strongly coupled to its bus waveguide. From the master equation for the full open quantum system of the reconfigurable phononic memory, we show that it is possible to perform read and write operations with over 90% quantum state transfer fidelity for an exponentially decaying pulse.
The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configurations performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernels performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernels statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.
Accelerating deep model training and inference is crucial in practice. Existing deep learning frameworks usually concentrate on optimizing training speed and pay fewer attentions to inference-specific optimizations. Actually, model inference differs from training in terms of computation, e.g. parameters are refreshed each gradient update step during training, but kept invariant during inference. These special characteristics of model inference open new opportunities for its optimization. In this paper, we propose a hardware-aware optimization framework, namely Woodpecker-DL (WPK), to accelerate inference by taking advantage of multiple joint optimizations from the perspectives of graph optimization, automated searches, domain-specific language (DSL) compiler techniques and system-level exploration. In WPK, we investigated two new automated search approaches based on genetic algorithm and reinforcement learning, respectively, to hunt the best operator code configurations targeting specific hardware. A customized DSL compiler is further attached to these search algorithms to generate efficient codes. To create an optimized inference plan, WPK systematically explores high-speed operator implementations from third-party libraries besides our automatically generated codes and singles out the best implementation per operator for use. Extensive experiments demonstrated that on a Tesla P100 GPU, we can achieve the maximum speedup of 5.40 over cuDNN and 1.63 over TVM on individual convolution operators, and run up to 1.18 times faster than TensorRT for end-to-end model inference.
Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse models cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix computations. As such, prior works usually modify or design completely new sparsity-optimized architectures for exploiting sparsity. We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures. Our work builds upon the insight that the matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We propose a tiling-friendly tile-wise sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy. We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup over the dense model.
As supercomputers continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve the I/O performance. However, little work has been done for effectively offloading lossy compression onto FPGA-based SmartNICs to reduce the compression overhead. In this paper, we propose a hardware-algorithm co-design of efficient and adaptive lossy compressor for scientific data on FPGAs (called CEAZ) to accelerate parallel I/O. Our contribution is fourfold: (1) We propose an efficient Huffman coding approach that can adaptively update Huffman codewords online based on codewords generated offline (from a variety of representative scientific datasets). (2) We derive a theoretical analysis to support a precise control of compression ratio under an error-bounded compression mode, enabling accurate offline Huffman codewords generation. This also helps us create a fixed-ratio compression mode for consistent throughput. (3) We develop an efficient compression pipeline by adopting cuSZs dual-quantization algorithm to our hardware use case. (4) We evaluate CEAZ on five real-world datasets with both a single FPGA board and 128 nodes from Bridges-2 supercomputer. Experiments show that CEAZ outperforms the second-best FPGA-based lossy compressor by 2X of throughput and 9.6X of compression ratio. It also improves MPI_File_write and MPI_Gather throughputs by up to 25.8X and 24.8X, respectively.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا