No Arabic abstract
Recently, Graph Neural Networks (GNNs) have received a lot of interest because of their success in learning representations from graph structured data. However, GNNs exhibit different compute and memory characteristics compared to traditional Deep Neural Networks (DNNs). Graph convolutions require feature aggregations from neighboring nodes (known as the aggregation phase), which leads to highly irregular data accesses. GNNs also have a very regular compute phase that can be broken down to matrix multiplications (known as the combination phase). All recently proposed GNN accelerators utilize different dataflows and microarchitecture optimizations for these two phases. Different communication strategies between the two phases have been also used. However, as more custom GNN accelerators are proposed, the harder it is to qualitatively classify them and quantitatively contrast them. In this work, we present a taxonomy to describe several diverse dataflows for running GNN inference on accelerators. This provides a structured way to describe and compare the design-space of GNN accelerators.
Optimizing data-intensive workflow execution is essential to many modern scientific projects such as the Square Kilometre Array (SKA), which will be the largest radio telescope in the world, collecting terabytes of data per second for the next few decades. At the core of the SKA Science Data Processor is the graph execution engine, scheduling tens of thousands of algorithmic components to ingest and transform millions of parallel data chunks in order to solve a series of large-scale inverse problems within the power budget. To tackle this challenge, we have developed the Data Activated Liu Graph Engine (DALiuGE) to manage data processing pipelines for several SKA pathfinder projects. In this paper, we discuss the DALiuGE graph scheduling sub-system. By extending previous studies on graph scheduling and partitioning, we lay the foundation on which we can develop polynomial time optimization methods that minimize both workflow execution time and resource footprint while satisfying resource constraints imposed by individual algorithms. We show preliminary results obtained from three radio astronomy data pipelines.
A critical challenge for modern system design is meeting the overwhelming performance, storage, and communication bandwidth demand of emerging applications within a tightly bound power budget. As both the time and power, hence the energy, spent in data communication by far exceeds the energy spent in actual data generation (i.e., computation), (re)computing data can easily become cheaper than storing and retrieving (pre)computed data. Therefore, trading computation for communication can improve energy efficiency by minimizing the energy overhead incurred by data storage, retrieval, and communication. This paper hence provides a taxonomy for the computation vs. communication trade-off along with quantitative characterization.
This paper tries to reduce the effort of learning, deploying, and integrating several frameworks for the development of e-Science applications that combine simulations with High-Performance Data Analytics (HPDA). We propose a way to extend task-based management systems to support continuous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows from now on) using a single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements. To illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library and a fully functional prototype extending COMPSs, a mature, general-purpose, task-based, parallel programming model. The library can be easily integrated with existing task-based frameworks to provide support for dataflows. Also, it provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end.
Low latency, high throughput inference on Convolution Neural Networks (CNNs) remains a challenge, especially for applications requiring large input or large kernel sizes. 4F optics provides a solution to accelerate CNNs by converting convolutions into Fourier-domain point-wise multiplications that are computationally free in optical domain. However, existing 4F CNN systems suffer from the all-positive sensor readout issue which makes the implementation of a multi-channel, multi-layer CNN not scalable or even impractical. In this paper we propose a simple channel tiling scheme for 4F CNN systems that utilizes the high resolution of 4F system to perform channel summation inherently in optical domain before sensor detection, so the outputs of different channels can be correctly accumulated. Compared to state of the art, channel tiling gives similar accuracy, significantly better robustness to sensing quantization (33% improvement in required sensing precision) error and noise (10dB reduction in tolerable sensing noise), 0.5X total filters required, 10-50X+ throughput improvement and as much as 3X reduction in required output camera resolution/bandwidth. Not requiring any additional optical hardware, the proposed channel tiling approach addresses an important throughput and precision bottleneck of high-speed, massively-parallel optical 4F computing systems.
This work presents a heterogeneous communication library for clusters of processors and FPGAs. This library, Shoal, supports the Partitioned Global Address Space (PGAS) memory model for applications. PGAS is a shared memory model for clusters that creates a distinction between local and remote memory access. Through Shoal and its common application programming interface for hardware and software, applications can be more freely migrated to the optimal platform and deployed onto dynamic cluster topologies. The library is tested using a thorough suite of microbenchmarks to establish latency and throughput performance. We also show an implementation of the Jacobi iterative method that demonstrates the ease with which applications can be moved between platforms to yield faster run times. Through this work, we have demonstrated the feasibility of using a PGAS programming model for multi-node heterogeneous platforms.