Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Dataset Lifecycle Framework and its applications in Bioinformatics

381 0 0.0 ( 0 )

Download Cite

Added by David Yuan

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Yiannis Gkoufas Technology

Distributed Parallel and Cluster Computing Emerging Technologies

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Bioinformatics pipelines depend on shared POSIX filesystems for its input, output and intermediate data storage. Containerization makes it more difficult for the workloads to access the shared file systems. In our previous study, we were able to run both ML and non-ML pipelines on Kubeflow successfully. However, the storage solutions were complex and less optimal. This is because there are no established resource types to represent the concept of data source on Kubernetes. More and more applications are running on Kubernetes for batch processing. End users are burdened with configuring and optimising the data access, which is what we have experienced before. In this article, we are introducing a new concept of Dataset and its corresponding resource as a native Kubernetes object. We have leveraged the Dataset Lifecycle Framework which takes care of all the low-level details about data access in Kubernetes pods. Its pluggable architecture is designed for the development of caching, scheduling and governance plugins. Together, they manage the entire lifecycle of the custom resource Dataset. We use Dataset Lifecycle Framework to serve data from object stores to both ML and non-ML pipelines running on Kubeflow. With DLF, we make training data fed into ML models directly without being downloaded to the local disks, which makes the input scalable. We have enhanced the durability of training metadata by storing it into a dataset, which also simplifies the set up of the Tensorboard, separated from the notebook server. For the non-ML pipeline, we have simplified the 1000 Genome Project pipeline with datasets injected into the pipeline dynamically. In addition, our preliminary results indicate that the pluggable caching mechanism can improve the performance significantly.

rate research

BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments

98 - Maria Luiza Mondelli , Thiago Magalh~aes , Guilherme Loss 2018

Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.

Distributed Parallel and Cluster Computing Databases

Orchestrating the Development Lifecycle of Machine Learning-Based IoT Applications: A Taxonomy and Survey

126 - Bin Qian , Jie Su , Zhenyu Wen 2019

Machine Learning (ML) and Internet of Things (IoT) are complementary advances: ML techniques unlock complete potentials of IoT with intelligence, and IoT applications increasingly feed data collected by sensors into ML models, thereby employing results to improve their business processes and services. Hence, orchestrating ML pipelines that encompasses model training and implication involved in holistic development lifecycle of an IoT application often leads to complex system integration. This paper provides a comprehensive and systematic survey on the development lifecycle of ML-based IoT application. We outline core roadmap and taxonomy, and subsequently assess and compare existing standard techniques used in individual stage.

Distributed Parallel and Cluster Computing Machine Learning Networking and Internet Architecture

Toward Scalable Machine Learning and Data Mining: the Bioinformatics Case

95 - Faraz Faghri , Sayed Hadi Hashemi , Mohammad Babaeizadeh 2017

In an effort to overcome the data deluge in computational biology and bioinformatics and to facilitate bioinformatics research in the era of big data, we identify some of the most influential algorithms that have been widely used in the bioinformatics community. These top data mining and machine learning algorithms cover classification, clustering, regression, graphical model-based learning, and dimensionality reduction. The goal of this study is to guide the focus of scalable computing experts in the endeavor of applying new storage and scalable computation designs to bioinformatics algorithms that merit their attention most, following the engineering maxim of optimize the common case.

Distributed Parallel and Cluster Computing Machine Learning Machine Learning

A Benchmarking Framework for Interactive 3D Applications in the Cloud

131 - Tianyi Liu , Sen He , Sunzhou Huang 2020

With the growing popularity of cloud gaming and cloud virtual reality (VR), interactive 3D applications have become a major type of workloads for the cloud. However, despite their growing importance, there is limited public research on how to design cloud systems to efficiently support these applications, due to the lack of an open and reliable research infrastructure, including benchmarks and performance analysis tools. The challenges of generating human-like inputs under various system/application randomness and dissecting the performance of complex graphics systems make it very difficult to design such an infrastructure. In this paper, we present the design of a novel cloud graphics rendering research infrastructure, Pictor. Pictor employs AI to mimic human interactions with complex 3D applications. It can also provide in-depth performance measurements for the complex software and hardware stack used for cloud 3D graphics rendering. With Pictor, we designed a benchmark suite with six interactive 3D applications. Performance analyses were conducted with these benchmarks to characterize 3D applications in the cloud and reveal new performance bottlenecks. To demonstrate the effectiveness of Pictor, we also implemented two optimizations to address two performance bottlenecks discovered in a state-of-the-art cloud 3D-graphics rendering system, which improved the frame rate by 57.7% on average.

Distributed Parallel and Cluster Computing Graphics

Nonlinear nonlocal multicontinua upscaling framework and its applications

67 - Wing T. Leung , Eric T. Chung , Yalchin Efendiev 2018

In this paper, we discuss multiscale methods for nonlinear problems. The main idea of these approaches is to use local constraints and solve problems in oversampled regions for constructing macroscopic equations. These techniques are intended for problems without scale separation and high contrast, which often occur in applications. For linear problems, the local solutions with constraints are used as basis functions. This technique is called Constraint Energy Minimizing Generalized Multiscale Finite Element Method (CEM-GMsFEM). GMsFEM identifies macroscopic quantities based on rigorous analysis. In corresponding upscaling methods, the multiscale basis functions are selected such that the degrees of freedom have physical meanings, such as averages of the solution on each continuum. This paper extends the linear concepts to nonlinear problems, where the local problems are nonlinear. The main concept consists of: (1) identifying macroscopic quantities; (2) constructing appropriate oversampled local problems with coarse-grid constraints; (3) formulating macroscopic equations. We consider two types of approaches. In the first approach, the solutions of local problems are used as basis functions (in a linear fashion) to solve nonlinear problems. This approach is simple to implement; however, it lacks the nonlinear interpolation, which we present in our second approach. In this approach, the local solutions are used as a nonlinear forward map from local averages (constraints) of the solution in oversampling region. This local fine-grid solution is further used to formulate the coarse-grid problem. Both approaches are discussed on several examples and applied to single-phase and two-phase flow problems, which are challenging because of convection-dominated nature of the concentration equation.

Numerical Analysis

comments

Fetching comments

Al Rasheed International University for Science & Technology

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Dataset Lifecycle Framework and its applications in Bioinformatics

Ask ChatGPT about the research

No Arabic abstract

Read More