JANUS: an FPGA-based System for High Performance Scientific Computing

609 0 0.0 ( 0 )

Download Cite

Added by Andrea Maiorano

Publication date 2008

fields Informatics Engineering

and research's language is English

Authors F. Belletti - M. Cotallo - A. Cruz

Hardware Architecture

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

This paper describes JANUS, a modular massively parallel and reconfigurable FPGA-based computing system. Each JANUS module has a computational core and a host. The computational core is a 4x4 array of FPGA-based processing elements with nearest-neighbor data links. Processors are also directly connected to an I/O node attached to the JANUS host, a conventional PC. JANUS is tailored for, but not limited to, the requirements of a class of hard scientific applications characterized by regular code structure, unconventional data manipulation instructions and not too large data-base size. We discuss the architecture of this configurable machine, and focus on its use on Monte Carlo simulations of statistical mechanics. On this class of application JANUS achieves impressive performances: in some cases one JANUS processing element outperfoms high-end PCs by a factor ~ 1000. We also discuss the role of JANUS on other classes of scientific applications.

rate research

FPGA-based Hyrbid Memory Emulation System

86 - Fei Wen , Mian Qin , Paul V. Gratz 2020

Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of two memory classes point towards the design of hybrid memory systems containing multiple classes of main memory. In the iterative and incremental development of new architectures, the timeliness of simulation completion is critical to project progression. Hence, a highly efficient simulation method is needed to evaluate the performance of different hybrid memory system designs. Design exploration for hybrid memory systems is challenging, because it requires emulation of the full system stack, including the OS, memory controller, and interconnect. Moreover, benchmark applications for memory performance test typically have much larger working sets, thus taking even longer simulation warm-up period. In this paper, we propose a FPGA-based hybrid memory system emulation platform. We target at the mobile computing system, which is sensitive to energy consumption and is likely to adopt NVM for its power efficiency. Here, because the focus of our platform is on the design of the hybrid memory system, we leverage the on-board hard IP ARM processors to both improve simulation performance while improving accuracy of the results. Thus, users can implement their data placement/migration policies with the FPGA logic elements and evaluate new designs quickly and effectively. Results show that our emulation platform provides a speedup of 9280x in simulation time compared to the software counterpart Gem5.

Hardware Architecture

Strategies for High-Throughput FPGA-based QC-LDPC Decoder Architecture

618 - Swapnil Mhaske , Hojin Kee , Tai Ly 2015

We propose without loss of generality strategies to achieve a high-throughput FPGA-based architecture for a QC-LDPC code based on a circulant-1 identity matrix construction. We present a novel representation of the parity-check matrix (PCM) providing a multi-fold throughput gain. Splitting of the node processing algorithm enables us to achieve pipelining of blocks and hence layers. By partitioning the PCM into not only layers but superlayers we derive an upper bound on the pipelining depth for the compact representation. To validate the architecture, a decoder for the IEEE 802.11n (2012) QC-LDPC is implemented on the Xilinx Kintex-7 FPGA with the help of the FPGA IP compiler [2] available in the NI LabVIEW Communication System Design Suite (CSDS) which offers an automated and systematic compilation flow where an optimized hardware implementation from the LDPC algorithm was generated in approximately 3 minutes, achieving an overall throughput of 608Mb/s (at 260MHz). As per our knowledge this is the fastest implementation of the IEEE 802.11n QC-LDPC decoder using an algorithmic compiler.

Hardware Architecture Information Theory Information Theory

High Performance Power Spectrum Analysis Using a FPGA Based Reconfigurable Computing Platform

327 - Yogindra Abhyankar 2011

Power-spectrum analysis is an important tool providing critical information about a signal. The range of applications includes communication-systems to DNA-sequencing. If there is interference present on a transmitted signal, it could be due to a natural cause or superimposed forcefully. In the latter case, its early detection and analysis becomes important. In such situations having a small observation window, a quick look at power-spectrum can reveal a great deal of information, including frequency and source of interference. In this paper, we present our design of a FPGA based reconfigurable platform for high performance power-spectrum analysis. This allows for the real-time data-acquisition and processing of samples of the incoming signal in a small time frame. The processing consists of computation of power, its average and peak, over a set of input values. This platform sustains simultaneous data streams on each of the four input channels.

Instrumentation and Methods for Astrophysics

Programmable FPGA-based Memory Controller

143 - Sasindu Wijeratne , Sanket Pattnaik , Zhiyu Chen 2021

Even with generational improvements in DRAM technology, memory access latency still remains the major bottleneck for application accelerators, primarily due to limitations in memory interface IPs which cannot fully account for variations in target applications, the algorithms used, and accelerator architectures. Since developing memory controllers for different applications is time-consuming, this paper introduces a modular and programmable memory controller that can be configured for different target applications on available hardware resources. The proposed memory controller efficiently supports cache-line accesses along with bulk memory transfers. The user can configure the controller depending on the available logic resources on the FPGA, memory access pattern, and external memory specifications. The modular design supports various memory access optimization techniques including, request scheduling, internal caching, and direct memory access. These techniques contribute to reducing the overall latency while maintaining high sustained bandwidth. We implement the system on a state-of-the-art FPGA and evaluate its performance using two widely studied domains: graph analytics and deep learning workloads. We show improved overall memory access time up to 58% on CNN and GCN workloads compared with commercial memory controller IPs.

Hardware Architecture Artificial Intelligence Distributed Parallel and Cluster Computing

A high performance scientific cloud computing environment for materials simulations

462 - Kevin Jorissen , Fernando D. Vila , 2011

We describe the development of a scientific cloud computing (SCC) platform that offers high performance computation capability. The platform consists of a scientific virtual machine prototype containing a UNIX operating system and several materials science codes, together with essential interface tools (an SCC toolset) that offers functionality comparable to local compute clusters. In particular, our SCC toolset provides automatic creation of virtual clusters for parallel computing, including tools for execution and monitoring performance, as well as efficient I/O utilities that enable seamless connections to and from the cloud. Our SCC platform is optimized for the Amazon Elastic Compute Cloud (EC2). We present benchmarks for prototypical scientific applications and demonstrate performance comparable to local compute clusters. To facilitate code execution and provide user-friendly access, we have also integrated cloud computing capability in a JAVA-based GUI. Our SCC platform may be an alternative to traditional HPC resources for materials science or quantum chemistry applications.

Computational Physics Materials Science Computational Engineering