No Arabic abstract
Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
Bioinformatics pipelines depend on shared POSIX filesystems for its input, output and intermediate data storage. Containerization makes it more difficult for the workloads to access the shared file systems. In our previous study, we were able to run both ML and non-ML pipelines on Kubeflow successfully. However, the storage solutions were complex and less optimal. This is because there are no established resource types to represent the concept of data source on Kubernetes. More and more applications are running on Kubernetes for batch processing. End users are burdened with configuring and optimising the data access, which is what we have experienced before. In this article, we are introducing a new concept of Dataset and its corresponding resource as a native Kubernetes object. We have leveraged the Dataset Lifecycle Framework which takes care of all the low-level details about data access in Kubernetes pods. Its pluggable architecture is designed for the development of caching, scheduling and governance plugins. Together, they manage the entire lifecycle of the custom resource Dataset. We use Dataset Lifecycle Framework to serve data from object stores to both ML and non-ML pipelines running on Kubeflow. With DLF, we make training data fed into ML models directly without being downloaded to the local disks, which makes the input scalable. We have enhanced the durability of training metadata by storing it into a dataset, which also simplifies the set up of the Tensorboard, separated from the notebook server. For the non-ML pipeline, we have simplified the 1000 Genome Project pipeline with datasets injected into the pipeline dynamically. In addition, our preliminary results indicate that the pluggable caching mechanism can improve the performance significantly.
The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single nodes ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of todays popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylons architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.
An emerging class of data-intensive applications involve the geographically dispersed extraction of complex scientific information from very large collections of measured or computed data. Such applications arise, for example, in experimental physics, where the data in question is generated by accelerators, and in simulation science, where the data is generated by supercomputers. So-called Data Grids provide essential infrastructure for such applications, much as the Internet provides essential services for applications such as e-mail and the Web. We describe here two services that we believe are fundamental to any Data Grid: reliable, high-speed transporet and replica management. Our high-speed transport service, GridFTP, extends the popular FTP protocol with new features required for Data Grid applciations, such as striping and partial file access. Our replica management service integrates a replica catalog with GridFTP transfers to provide for the creation, registration, location, and management of dataset replicas. We present the design of both services and also preliminary performance results. Our implementations exploit security and other services provided by the Globus Toolkit.
To harness the potential of advanced computing technologies, efficient (real time) analysis of large amounts of data is as essential as are front-line simulations. In order to optimise this process, experts need to be supported by appropriate tools that allow to interactively guide both the computation and data exploration of the underlying simulation code. The main challenge is to seamlessly feed the user requirements back into the simulation. State-of-the-art attempts to achieve this, have resulted in the insertion of so-called check- and break-points at fixed places in the code. Depending on the size of the problem, this can still compromise the benefits of such an attempt, thus, preventing the experience of real interactive computing. To leverage the concept for a broader scope of applications, it is essential that a user receives an immediate response from the simulation to his or her changes. Our generic integration framework, targeted to the needs of the computational engineering domain, supports distributed computations as well as on-the-fly visualisation in order to reduce latency and enable a high degree of interactivity with only minor code modifications. Namely, the regular course of the simulation coupled to our framework is interrupted in small, cyclic intervals followed by a check for updates. When new data is received, the simulation restarts automatically with the updated settings (boundary conditions, simulation parameters, etc.). To obtain rapid, albeit approximate feedback from the simulation in case of perpetual user interaction, a multi-hierarchical approach is advantageous. Within several different engineering test cases, we will demonstrate the flexibility and the effectiveness of our approach.
In response to the soaring needs of human mobility data, especially during disaster events such as the COVID-19 pandemic, and the associated big data challenges, we develop a scalable online platform for extracting, analyzing, and sharing multi-source multi-scale human mobility flows. Within the platform, an origin-destination-time (ODT) data model is proposed to work with scalable query engines to handle heterogenous mobility data in large volumes with extensive spatial coverage, which allows for efficient extraction, query, and aggregation of billion-level origin-destination (OD) flows in parallel at the server-side. An interactive spatial web portal, ODT Flow Explorer, is developed to allow users to explore multi-source mobility datasets with user-defined spatiotemporal scales. To promote reproducibility and replicability, we further develop ODT Flow REST APIs that provide researchers with the flexibility to access the data programmatically via workflows, codes, and programs. Demonstrations are provided to illustrate the potential of the APIs integrating with scientific workflows and with the Jupyter Notebook environment. We believe the platform coupled with the derived multi-scale mobility data can assist human mobility monitoring and analysis during disaster events such as the ongoing COVID-19 pandemic and benefit both scientific communities and the general public in understanding human mobility dynamics.