No Arabic abstract
Background: Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alternative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC board contains only ~60 cores (while a GPU board typically has over a thousand cores). Results: To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MICs limitation and the extra parallelism inside each MIC core. Experiments on aligning 150bp paired-end reads show that MICA using one MIC board is 4.9 times faster than the BWA-MEM (using 6-core of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICAs simplicity allows very efficient scale-up when multiple MIC boards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM). Summary: MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour less than 400 nodes. MICA has impressive performance even though the current MIC is at its initial stage of development (the next generation of MIC has been announced to release in late 2014).
We propose a solution to the increased computational demands of Extremely Large Telescope (ELT) scale adaptive optics (AO) real-time control with the Intel Xeon Phi Knights Landing (KNL) Many Integrated Core (MIC) Architecture. The computational demands of an AO real-time controller (RTC) scale with the fourth power of telescope diameter and so the next generation ELTs require orders of magnitude more processing power for the RTC pipeline than existing systems. The Xeon Phi contains a large number (> 64) of low power x86 CPU cores and high bandwidth memory integrated into a single socketed server CPU package. The increased parallelism and memory bandwidth are crucial to providing the performance for reconstructing wavefronts with the required precision for ELT scale AO. Here, we demonstrate that the Xeon Phi KNL is capable of performing ELT scale single conjugate AO real-time control computation at over 1.0 kHz with less than 20 {mu}s RMS jitter. We have also shown that with a wavefront sensor camera attached the KNL can process the real-time control loop at up to 966 Hz, the maximum frame-rate of the camera, with jitter remaining below 20 {mu}s RMS. Future studies will involve exploring the use of a cluster of Xeon Phis for the real-time control of the MCAO and MOAO regimes of AO. We find that the Xeon Phi is highly suitable for ELT AO real time control.
Cosmic dust particles effectively attenuate starlight. Their absorption of starlight produces emission spectra from the near- to far-infrared, which depends on the sizes and properties of the dust grains, and spectrum of the heating radiation field. The near- to mid-infrared is dominated by the emissions by very small grains. Modeling the absorption of starlight by these particles is, however, computationally expensive and a significant bottleneck for self-consistent radiation transport codes treating the heating of dust by stars. In this paper, we summarize the formalism for computing the stochastic emissivity of cosmic dust, which was developed in earlier works, and present a new library HEATCODE implementing this formalism for the calculation for arbitrary grain properties and heating radiation fields. Our library is highly optimized for general-purpose processors with multiple cores and vector instructions, with hierarchical memory cache structure. The HEATCODE library also efficiently runs on co-processor cards implementing the Intel Many Integrated Core (Intel MIC) architecture. We discuss in detail the optimization steps that we took in order to optimize for the Intel MIC architecture, which also significantly benefited the performance of the code on general-purpose processors, and provide code samples and performance benchmarks for each step. The HEATCODE library performance on a single Intel Xeon Phi coprocessor (Intel MIC architecture) is approximately 2 times a general-purpose two-socket multicore processor system with approximately the same nominal power consumption. The library supports heterogeneous calculations employing host processors simultaneously with multiple coprocessors, and can be easily incorporated into existing radiation transport codes.
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with required efficiency. Here we describe multiple algorithmic innovations to overcome this fundamental limitation, development and deployment of computational infrastructure at scale integrates multiple artificial intelligence and simulation-based approaches. Three measures of performance are:(i) throughput, the number of ligands per unit time; (ii) scientific performance, the number of effective ligands sampled per unit time and (iii) peak performance, in flop/s. The capabilities outlined here have been used in production for several months as the workhorse of the computational infrastructure to support the capabilities of the US-DOE National Virtual Biotechnology Laboratory in combination with resources from the EU Centre of Excellence in Computational Biomedicine.
The Deep Graph Library (DGL) was designed as a tool to enable structure learning from graphs, by supporting a core abstraction for graphs, including the popular Graph Neural Networks (GNN). DGL contains implementations of all core graph operations for both the CPU and GPU. In this paper, we focus specifically on CPU implementations and present performance analysis, optimizations and results across a set of GNN applications using the latest version of DGL(0.4.3). Across 7 applications, we achieve speed-ups ranging from1 1.5x-13x over the baseline CPU implementations.
Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. Large sequencing centers typically employ hundreds of such systems. Such high-throughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing environment, usually deploying multicore processors. Since the application can be easily parallelized for distributed memory systems, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of these kernels by 1) improving cache reuse, 2) simplifying the algorithms, 3) replacing small fragmented memory allocations with a few large contiguous ones, 4) software prefetching, and 5) SIMD utilization wherever applicable - and massive reorganization of the source code enabling these improvements. As a result, we achieved nearly 2x, 183x, and 8x speedups on the three kernels, respectively, resulting in up to 3.5x and 2.4x speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM.