Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

AccD: A Compiler-based Framework for Accelerating Distance-related Algorithms on CPU-FPGA Platforms

88 0 0.0 ( 0 )

Download Cite

Added by Yuke Wang

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Yuke Wang - Boyuan Feng - Gushu Li

Distributed Parallel and Cluster Computing Programming Languages

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

As a promising solution to boost the performance of distance-related algorithms (e.g., K-means and KNN), FPGA-based acceleration attracts lots of attention, but also comes with numerous challenges. In this work, we propose AccD, a compiler-based framework for accelerating distance-related algorithms on CPU-FPGA platforms. Specifically, AccD provides a Domain-specific Language to unify distance-related algorithms effectively, and an optimizing compiler to reconcile the benefits from both the algorithmic optimization on the CPU and the hardware acceleration on the FPGA. The output of AccD is a high-performance and power-efficient design that can be easily synthesized and deployed on mainstream CPU-FPGA platforms. Intensive experiments show that AccD designs achieve 31.42x speedup and 99.63x better energy efficiency on average over standard CPU-based implementations.

rate research

MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA

76 - Ji Liu , Abdullah-Al Kafi , Xipeng Shen 2020

OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, we propose a source-to-source compiler framework, MKPipe, for optimizing multi-kernel workloads in OpenCL for FPGA. Besides channels, we propose new schemes to enable multi-kernel pipelines. Our optimizing compiler employs a systematic approach to explore the tradeoffs of these optimizations methods. To enable more efficient overlapping between kernel execution, we also propose a novel workitem/workgroup-id remapping technique. Furthermore, we propose new algorithms for throughput balancing and resource balancing to tune the optimizations upon individual kernels in the multi-kernel workloads. Our results show that our compiler-optimized multi-kernels achieve up to 3.6x (1.4x on average) speedup over the baseline, in which the kernels have already been optimized individually.

Distributed Parallel and Cluster Computing Performance

Sparse Tucker Tensor Decomposition on a Hybrid FPGA-CPU Platform

67 - Weiyun Jiang , Kaiqi Zhang , Colin Yu Lin 2020

Recommendation systems, social network analysis, medical imaging, and data mining often involve processing sparse high-dimensional data. Such high-dimensional data are naturally represented as tensors, and they cannot be efficiently processed by conventional matrix or vector computations. Sparse Tucker decomposition is an important algorithm for compressing and analyzing these sparse high-dimensional data sets. When energy efficiency and data privacy are major concerns, hardware accelerators on resource-constraint platforms become crucial for the deployment of tensor algorithms. In this work, we propose a hybrid computing framework containing CPU and FPGA to accelerate sparse Tucker factorization. This algorithm has three main modules: tensor-times-matrix (TTM), Kronecker products, and QR decomposition with column pivoting (QRP). In addition, we accelerate the former two modules on a Xilinx FPGA and the latter one on a CPU. Our hybrid platform achieves $23.6 times sim 1091times$ speedup and over $93.519% sim 99.514 %$ energy savings compared with CPU on the synthetic and real-world datasets.

Distributed Parallel and Cluster Computing

Compiler-Driven FPGA Virtualization with SYNERGY

319 - Joshua Landgraf , Tiffany Yang , Will Lin 2021

FPGAs are increasingly common in modern applications, and cloud providers now support on-demand FPGA acceleration in data centers. Applications in data centers run on virtual infrastructure, where consolidation, multi-tenancy, and workload migration enable economies of scale that are fundamental to the providers business. However, a general strategy for virtualizing FPGAs has yet to emerge. While manufacturers struggle with hardware-based approaches, we propose a compiler/runtime-based solution called Synergy. We show a compiler transformation for Verilog programs that produces code able to yield control to software at sub-clock-tick granularity according to the semantics of the original program. Synergy uses this property to efficiently support core virtualization primitives: suspend and resume, program migration, and spatial/temporal multiplexing, on hardware which is available today. We use Synergy to virtualize FPGA workloads across a cluster of Altera SoCs and Xilinx FPGAs on Amazon F1. The workloads require no modification, run within 3-4x of unvirtualized performance, and incur a modest increase in FPGA fabric utilization.

Distributed Parallel and Cluster Computing Hardware Architecture Programming Languages

KPynq: A Work-Efficient Triangle-Inequality based K-means on FPGA

73 - Yuke Wang , Zhaorui Zeng , Boyuan Feng 2019

K-means is a popular but computation-intensive algorithm for unsupervised learning. To address this issue, we present KPynq, a work-efficient triangle-inequality based K-means on FPGA for handling large-size, high-dimension datasets. KPynq leverages an algorithm-level optimization to balance the performance and computation irregularity, and a hardware architecture design to fully exploit the pipeline and parallel processing capability of various FPGAs. In the experiment, KPynq consistently outperforms the CPU-based standard K-means in terms of its speedup (up to 4.2x) and significant energy-efficiency (up to 218x).

Distributed Parallel and Cluster Computing

Relay: A High-Level Compiler for Deep Learning

108 - Jared Roesch , Steven Lyubomirsky , Marisa Kirisame 2019

Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressivity, composability, and portability. We present Relay, a new compiler framework for DL. Relays functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of-the-art models. The introduction of Relays expressive IR requires careful design of domain-specific optimizations, addressed via Relays extension mechanisms. Using these extension mechanisms, Relay supports a unified compiler that can target a variety of hardware platforms. Our evaluation demonstrates Relays competitive performance for a broad class of models and devices (CPUs, GPUs, and emerging accelerators). Relays design demonstrates how a unified IR can provide expressivity, composability, and portability without compromising performance.

Machine Learning Programming Languages Machine Learning

comments

Fetching comments

University of Mosul

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

AccD: A Compiler-based Framework for Accelerating Distance-related Algorithms on CPU-FPGA Platforms

Ask ChatGPT about the research

No Arabic abstract

Read More