No Arabic abstract
Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an FPGA+HBM-based accelerator connected through IBM OCAPI (Open Coherent Accelerator Processor Interface) to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by 5.3x and 12.7x when running two different compound stencil kernels. NERO reduces the energy consumption by 12x and 35x for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/Watt and 21.01 GFLOPS/Watt. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.
Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an FPGA+HBM-based accelerator connected through IBM CAPI2 (Coherent Accelerator Processor Interface) to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by 4.2x and 8.3x when running two different compound stencil kernels. NERO reduces the energy consumption by 22x and 29x for the same two kernels over the POWER9 system with an energy efficiency of 1.5 GFLOPS/Watt and 17.3 GFLOPS/Watt. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.
Cloud applications are increasingly relying on hundreds of loosely-coupled microservices to complete user requests that meet an applications end-to-end QoS requirements. Communication time between services accounts for a large fraction of the end-to-end latency and can introduce performance unpredictability and QoS violations. This work presents our early work on Dagger, a hardware acceleration platform for networking, designed specifically with the unique qualities of microservices in mind. The Dagger architecture relies on an FPGA-based NIC, closely coupled with the processor over a configurable memory interconnect, designed to offload and accelerate RPC stacks. Unlike the traditional cloud systems that use PCIe links as the NIC I/O interface, we leverage memory-interconnected FPGAs as networking devices to provide the efficiency, transparency, and programmability needed for fine-grained microservices. We show that this considerably improves CPU utilization and performance for cloud RPCs.
The ongoing shift of cloud services from monolithic designs to microservices creates high demand for efficient and high performance datacenter networking stacks, optimized for fine-grained workloads. Commodity networking systems based on software stacks and peripheral NICs introduce high overheads when it comes to delivering small messages. We present Dagger, a hardware acceleration fabric for cloud RPCs based on FPGAs, where the accelerator is closely-coupled with the host processor over a configurable memory interconnect. The three key design principle of Dagger are: (1) offloading the entire RPC stack to an FPGA-based NIC, (2) leveraging memory interconnects instead of PCIe buses as the interface with the host CPU, and (3) making the acceleration fabric reconfigurable, so it can accommodate the diverse needs of microservices. We show that the combination of these principles significantly improves the efficiency and performance of cloud RPC systems while preserving their generality. Dagger achieves 1.3-3.8x higher per-core RPC throughput compared to both highly-optimized software stacks, and systems using specialized RDMA adapters. It also scales up to 84 Mrps with 8 threads on 4 CPU cores, while maintaining state-of-the-art us-scale tail latency. We also demonstrate that large third-party applications, like memcached and MICA KVS, can be easily ported on Dagger with minimal changes to their codebase, bringing their median and tail KVS access latency down to 2.8 - 3.5us and 5.4 - 7.8us, respectively. Finally, we show that Dagger is beneficial for multi-tier end-to-end microservices with different threading models by evaluating it using an 8-tier application implementing a flight check-in service.
We present a vision for the Erudite architecture that redefines the compute and memory abstractions such that memory bandwidth and capacity become first-class citizens along with compute throughput. In this architecture, we envision coupling a high-density, massively parallel memory technology like Flash with programmable near-data accelerators, like the streaming multiprocessors in modern GPUs. Each accelerator has a local pool of storage-class memory that it can access at high throughput by initiating very large numbers of overlapping requests that help to tolerate long access latency. The accelerators can also communicate with each other and remote memory through a high-throughput low-latency interconnect. As a result, systems based on the Erudite architecture scale compute and memory bandwidth at the same rate, tearing down the notorious memory wall that has plagued computer architecture for generations. In this paper, we present the motivation, rationale, design, benefit, and research challenges for Erudite.
Deep convolutional neural networks have achieved remarkable progress in recent years. However, the large volume of intermediate results generated during inference poses a significant challenge to the accelerator design for resource-constraint FPGA. Due to the limited on-chip storage, partial results of intermediate layers are frequently transferred back and forth between on-chip memory and off-chip DRAM, leading to a non-negligible increase in latency and energy consumption. In this paper, we propose block convolution, a hardware-friendly, simple, yet efficient convolution operation that can completely avoid the off-chip transfer of intermediate feature maps at run-time. The fundamental idea of block convolution is to eliminate the dependency of feature map tiles in the spatial dimension when spatial tiling is used, which is realized by splitting a feature map into independent blocks so that convolution can be performed separately on individual blocks. We conduct extensive experiments to demonstrate the efficacy of the proposed block convolution on both the algorithm side and the hardware side. Specifically, we evaluate block convolution on 1) VGG-16, ResNet-18, ResNet-50, and MobileNet-V1 for ImageNet classification task; 2) SSD, FPN for COCO object detection task, and 3) VDSR for Set5 single image super-resolution task. Experimental results demonstrate that comparable or higher accuracy can be achieved with block convolution. We also showcase two CNN accelerators via algorithm/hardware co-design based on block convolution on memory-limited FPGAs, and evaluation shows that both accelerators substantially outperform the baseline without off-chip transfer of intermediate feature maps.