No Arabic abstract
OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, we propose a source-to-source compiler framework, MKPipe, for optimizing multi-kernel workloads in OpenCL for FPGA. Besides channels, we propose new schemes to enable multi-kernel pipelines. Our optimizing compiler employs a systematic approach to explore the tradeoffs of these optimizations methods. To enable more efficient overlapping between kernel execution, we also propose a novel workitem/workgroup-id remapping technique. Furthermore, we propose new algorithms for throughput balancing and resource balancing to tune the optimizations upon individual kernels in the multi-kernel workloads. Our results show that our compiler-optimized multi-kernels achieve up to 3.6x (1.4x on average) speedup over the baseline, in which the kernels have already been optimized individually.
We show how to quantify scalability with the Universal Scalability Law (USL) by applying it to performance measurements of memcached, J2EE, and Weblogic on multi-core platforms. Since commercial multicores are essentially black-boxes, the accessible performance gains are primarily available at the application level. We also demonstrate how our methodology can identify the most significant performance tuning opportunities to optimize application scalability, as well as providing an easy means for exploring other aspects of the multi-core system design space.
As a promising solution to boost the performance of distance-related algorithms (e.g., K-means and KNN), FPGA-based acceleration attracts lots of attention, but also comes with numerous challenges. In this work, we propose AccD, a compiler-based framework for accelerating distance-related algorithms on CPU-FPGA platforms. Specifically, AccD provides a Domain-specific Language to unify distance-related algorithms effectively, and an optimizing compiler to reconcile the benefits from both the algorithmic optimization on the CPU and the hardware acceleration on the FPGA. The output of AccD is a high-performance and power-efficient design that can be easily synthesized and deployed on mainstream CPU-FPGA platforms. Intensive experiments show that AccD designs achieve 31.42x speedup and 99.63x better energy efficiency on average over standard CPU-based implementations.
We present RDMAbox, a set of low level RDMA optimizations that provide better performance than previous approaches. The optimizations are packaged in easy-to-use kernel and user space libraries for applications and systems in data center. We demonstrate the flexibility and effectiveness of RDMAbox by implementing a kernel remote paging system and a user space file system using RDMAbox. RDMAbox employs two optimization techniques. First, we suggest RDMA request merging and chaining to further reduce the total number of I/O operations to the RDMA NIC. The I/O merge queue at the same time functions as a traffic regulator to enforce admission control and avoid overloading the NIC. Second, we propose Adaptive Polling to achieve higher efficiency of polling Work Completion than existing busy polling while maintaining the low CPU overhead of event trigger. Our implementation of a remote paging system with RDMAbox outperforms existing representative solutions with up to 4? throughput improvement and up to 83% decrease in average tail latency in bigdata workloads, and up to 83% reduction in completion time in machine learning workloads. Our implementation of a user space file system based on RDMAbox achieves up to 5.9? higher throughput over existing representative solutions.
Selecting the right compiler optimisations has a severe impact on programs performance. Still, the available optimisations keep increasing, and their effect depends on the specific program, making the task human intractable. Researchers proposed several techniques to search in the space of compiler optimisations. Some approaches focus on finding better search algorithms, while others try to speed up the search by leveraging previously collected knowledge. The possibility to effectively reuse previous compilation results inspired us toward the investigation of techniques derived from the Recommender Systems field. The proposed approach exploits previously collected knowledge and improves its characterisation over time. Differently from current state-of-the-art solutions, our approach is not based on performance counters but relies on Reaction Matching, an algorithm able to characterise programs looking at how they react to different optimisation sets. The proposed approach has been validated using two widely used benchmark suites, cBench and PolyBench, including 54 different programs. Our solution, on average, extracted 90% of the available performance improvement 10 iterations before current state-of-the-art solutions, which corresponds to 40% fewer compilations and performance tests to perform.
FPGAs are increasingly common in modern applications, and cloud providers now support on-demand FPGA acceleration in data centers. Applications in data centers run on virtual infrastructure, where consolidation, multi-tenancy, and workload migration enable economies of scale that are fundamental to the providers business. However, a general strategy for virtualizing FPGAs has yet to emerge. While manufacturers struggle with hardware-based approaches, we propose a compiler/runtime-based solution called Synergy. We show a compiler transformation for Verilog programs that produces code able to yield control to software at sub-clock-tick granularity according to the semantics of the original program. Synergy uses this property to efficiently support core virtualization primitives: suspend and resume, program migration, and spatial/temporal multiplexing, on hardware which is available today. We use Synergy to virtualize FPGA workloads across a cluster of Altera SoCs and Xilinx FPGAs on Amazon F1. The workloads require no modification, run within 3-4x of unvirtualized performance, and incur a modest increase in FPGA fabric utilization.