ترغب بنشر مسار تعليمي؟ اضغط هنا

Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics

49   0   0.0 ( 0 )
 نشر من قبل Umberto Ferraro Petrillo
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of map-reduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a fully exploitation of the underly- ing distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.


قيم البحث

اقرأ أيضاً

Several fundamental changes in technology indicate domain-specific hardware and software co-design is the only path left. In this context, architecture, system, data management, and machine learning communities pay greater attention to innovative big data and AI algorithms, architecture, and systems. Unfortunately, complexity, diversity, frequently-changed workloads, and rapid evolution of big data and AI systems raise great challenges. First, the traditional benchmarking methodology that creates a new benchmark or proxy for every possible workload is not scalable, or even impossible for Big Data and AI benchmarking. Second, it is prohibitively expensive to tailor the architecture to characteristics of one or more application or even a domain of applications. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on different initial or intermediate data inputs, each class of which we call a data motif. On the basis of our previous work that identifies eight data motifs taking up most of the run time of a wide variety of big data and AI workloads, we propose a scalable benchmarking methodology that uses the combination of one or more data motifs---to represent diversity of big data and AI workloads. Following this methodology, we present a unified big data and AI benchmark suite---BigDataBench 4.0, publicly available from~url{http://prof.ict.ac.cn/BigDataBench}. This unified benchmark suite sheds new light on domain-specific hardware and software co-design: tailoring the system and architecture to characteristics of the unified eight data motifs other than one or more application case by case. Also, for the first time, we comprehensively characterize the CPU pipeline efficiency using the benchmarks of seven workload types in BigDataBench 4.0.
In one of the most important methods in Density Functional Theory - the Full-Potential Linearized Augmented Plane Wave (FLAPW) method - dense generalized eigenproblems are organized in long sequences. Moreover each eigenproblem is strongly correlated to the next one in the sequence. We propose a novel approach which exploits such correlation through the use of an eigensolver based on subspace iteration and accelerated with Chebyshev polynomials. The resulting solver, parallelized using the Elemental library framework, achieves excellent scalability and is competitive with current dense parallel eigensolvers.
The paper adopts parallel computing systems for predictive analysis in both CPU and GPU leveraging Spark Big Data platform. The traffic dataset is adopted to predict the traffic jams in Los Angeles County. It is collected from a popular platform in t he USA for tracking information on the road using the device information and reports shared by the users. Large-scale traffic data set can be stored and processed using both GPU and CPU in this Scalable Big Data systems. The major contribution of this paper is to improve the performance of machine learning in distributed parallel computing systems with GPU to predict the traffic congestion. We show that the parallel computing can be achieve using both GPU and CPU with the existing Apache Spark platform. Our method can be applicable to other large scale datasets in different domains. The process modeling, as well as results, are interpreted using computing time and metrics: AUC, Precision and Recall. It should help the traffic management in Smart City.
The practical realization of managing and executing large scale scientific computations efficiently and reliably is quite challenging. Scientific computations often involve thousands or even millions of tasks operating on large quantities of data, su ch data are often diversely structured and stored in heterogeneous physical formats, and scientists must specify and run such computations over extended periods on collections of compute, storage and network resources that are heterogeneous, distributed and may change constantly. We present the integration of several advanced systems: Swift, Karajan, and Falkon, to address the challenges in running various large scale scientific applications in Grid environments. Swift is a parallel programming tool for rapid and reliable specification, execution, and management of large-scale science and engineering workflows. Swift consists of a simple scripting language called SwiftScript and a powerful runtime system that is based on the CoG Karajan workflow engine and integrates the Falkon light-weight task execution service that uses multi-level scheduling and a streamlined dispatcher. We showcase the scalability, performance and reliability of the integrated system using application examples drawn from astronomy, cognitive neuroscience and molecular dynamics, which all comprise large number of fine-grained jobs. We show that Swift is able to represent dynamic workflows whose structures can only be determined during runtime and reduce largely the code size of various workflow representations using SwiftScript; schedule the execution of hundreds of thousands of parallel computations via the Karajan engine; and achieve up to 90% reduction in execution time when compared to traditional batch schedulers.
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed so lution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا