No Arabic abstract
Radio astronomical imaging arrays comprising large numbers of antennas, O(10^2-10^3) have posed a signal processing challenge because of the required O(N^2) cross correlation of signals from each antenna and requisite signal routing. This motivated the implementation of a Packetized Correlator architecture that applies Field Programmable Gate Arrays (FPGAs) to the O(N) F-stage transforming time domain to frequency domain data, and Graphics Processing Units (GPUs) to the O(N^2) X-stage performing an outer product among spectra for each antenna. The design is readily scalable to at least O(10^3) antennas. Fringes, visibility amplitudes and sky image results obtained during field testing are presented.
The digital correlator is a crucial element in a modern radio telescope. In this paper we describe a scalable design of the correlator system for the Tianlai pathfinder array, which is an experiment dedicated to test the key technologies for conducting 21cm intensity mapping survey. The correlator is of the FX design, which firstly performs Fast Fourier Transform (FFT) including Polyphase Filter Bank (PFB) computation using a Collaboration for Astronomy Signal Processing and Electronics Research (CASPER) Reconfigurable Open Architecture Computing Hardware-2 (ROACH2) board, then computes cross-correlations using Graphical Processing Units (GPUs). The design has been tested both in laboratory and in actual observation.
We present an overview of the ICE hardware and software framework that implements large arrays of interconnected FPGA-based data acquisition, signal processing and networking nodes economically. The system was conceived for application to radio, millimeter and sub-millimeter telescope readout systems that have requirements beyond typical off-the-shelf processing systems, such as careful control of interference signals produced by the digital electronics, and clocking of all elements in the system from a single precise observatory-derived oscillator. A new generation of telescopes operating at these frequency bands and designed with a vastly increased emphasis on digital signal processing to support their detector multiplexing technology or high-bandwidth correlators---data rates exceeding a terabyte per second---are becoming common. The ICE system is built around a custom FPGA motherboard that makes use of an Xilinx Kintex-7 FPGA and ARM-based co-processor. The system is specialized for specific applications through software, firmware, and custom mezzanine daughter boards that interface to the FPGA through the industry-standard FMC specifications. For high density applications, the motherboards are packaged in 16-slot crates with ICE backplanes that implement a low-cost passive full-mesh network between the motherboards in a crate, allow high bandwidth interconnection between crates, and enable data offload to a computer cluster. A Python-based control software library automatically detects and operates the hardware in the array. Examples of specific telescope applications of the ICE framework are presented, namely the frequency-multiplexed bolometer readout systems used for the SPT and Simons Array and the digitizer, F-engine, and networking engine for the CHIME and HIRAX radio interferometers.
We present the implementation and performance of a class of directionally unsplit Riemann-solver-based hydrodynamic schemes on Graphic Processing Units (GPU). These schemes, including the MUSCL-Hancock method, a variant of the MUSCL-Hancock method, and the corner-transport-upwind method, are embedded into the adaptive-mesh-refinement (AMR) code GAMER. Furthermore, a hybrid MPI/OpenMP model is investigated, which enables the full exploitation of the computing power in a heterogeneous CPU/GPU cluster and significantly improves the overall performance. Performance benchmarks are conducted on the Dirac GPU cluster at NERSC/LBNL using up to 32 Tesla C2050 GPUs. A single GPU achieves speed-ups of 101(25) and 84(22) for uniform-mesh and AMR simulations, respectively, as compared with the performance using one(four) CPU core(s), and the excellent performance persists in multi-GPU tests. In addition, we make a direct comparison between GAMER and the widely-adopted CPU code Athena (Stone et al. 2008) in adiabatic hydrodynamic tests and demonstrate that, with the same accuracy, GAMER is able to achieve two orders of magnitude performance speed-up.
FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.
The next generation of Adaptive Optics (AO) systems on large telescopes will require immense computation performance and memory bandwidth, both of which are challenging with the technology available today. The objective of this work is to create a future-proof adaptive optics platform on an FPGA architecture, which scales with the number of subapertures, pixels per subaperture and external memory. We have created a scalable adaptive optics platform with an off-the-shelf FPGA development board, which provides an AO reconstruction time only limited by the external memory bandwidth. SPARC uses the same logic resources irrespective of the number of subapertures in the AO system. This paper is aimed at embedded developers who are interested in the FPGA design and the accompanying hardware interfaces. The central theme of this paper is to show how scalability is incorporated at different levels of the FPGA implementation. This work is a continuation of Part 1 of the paper which explains the concept, objectives, control scheme and method of validation used for testing the platform.