بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Putting Data Science Pipelines on the Edge

104 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Genoveva Vargas-Solar

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ali Akoglu - Genoveva Vargas-Solar

قواعد البيانات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This paper proposes a composable Just in Time Architecture for Data Science (DS) Pipelines named JITA-4DS and associated resource management techniques for configuring disaggregated data centers (DCs). DCs under our approach are composable based on vertical integration of the application, middleware/operating system, and hardware layers customized dynamically to meet application Service Level Objectives (SLO - application-aware management). Thereby, pipelines utilize a set of flexible building blocks that can be dynamically and automatically assembled and re-assembled to meet the dynamic changes in the workloads SLOs. To assess disaggregated DCs, we study how to model and validate their performance in large-scale settings.

قيم البحث

87 - Genoveva Vargas-Solar , Ali Akoglu , Md Sahil Hassan 2021

This paper targets the execution of data science (DS) pipelines supported by data processing, transmission and sharing across several resources executing greedy processes. Current data science pipelines environments provide various infrastructure ser vices with computing resources such as general-purpose processors (GPP), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs) and Tensor Processing Unit (TPU) coupled with platform and software services to design, run and maintain DS pipelines. These one-fits-all solutions impose the complete externalization of data pipeline tasks. However, some tasks can be executed in the edge, and the backend can provide just in time resources to ensure ad-hoc and elastic execution environments. This paper introduces an innovative composable Just in Time Architecture for configuring DCs for Data Science Pipelines (JITA-4DS) and associated resource management techniques. JITA-4DS is a cross-layer management system that is aware of both the application characteristics and the underlying infrastructures to break the barriers between applications, middleware/operating system, and hardware layers. Vertical integration of these layers is needed for building a customizable Virtual Data Center (VDC) to meet the dynamically changing data science pipelines requirements such as performance, availability, and energy consumption. Accordingly, the paper shows an experimental simulation devoted to run data science workloads and determine the best strategies for scheduling the allocation of resources implemented by JITA-4DS.

النظم الموزعة والتوازية والحوسبة العنقودية

Lux: Always-on Visualization Recommendations for Exploratory Data Science

180 - Doris Jung-Lin Lee , Dixin Tang , Kunal Agarwal 2021

Exploratory data science largely happens in computational notebooks with dataframe API, such as pandas, that support flexible means to transform, clean, and analyze data. Yet, visually exploring data in dataframes remains tedious, requiring substanti al programming effort for visualization and mental effort to determine what analysis to perform next. We propose Lux, an always-on framework for accelerating visual insight discovery in data science workflows. When users print a dataframe in their notebooks, Lux recommends visualizations to provide a quick overview of the patterns and trends and suggests promising analysis directions. Lux features a high-level language for generating visualizations on-demand to encourage rapid visual experimentation with data. We demonstrate that through the use of a careful design and three system optimizations, Lux adds no more than two seconds of overhead on top of pandas for over 98% of datasets in the UCI repository. We evaluate Lux in terms of usability via a controlled first-use study and interviews with early adopters, finding that Lux helps fulfill the needs of data scientists for visualization support within their dataframe workflows. Lux has already been embraced by data science practitioners, with over 1.9k stars on Github within its first 15 months.

قواعد البيانات تفاعل الإنسان والحاسوب

Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search

80 - Junwen Yang , Yeye He , Surajit Chaudhuri 2021

Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple suc h steps end-to-end, by synthesizing complex data pipelines with both string transformations and table-manipulation operators. We propose a novel by-target paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target, users would provide input tables (e.g., csv or json files), and point us to a target table (e.g., an existing database table or BI dashboard) to demonstrate how the output from the desired pipeline would schematically look like. While the problem is seemingly underspecified, our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space to make the problem tractable. We develop an Auto-Pipeline system that learns to synthesize pipelines using reinforcement learning and search. Experiments on large numbers of real pipelines crawled from GitHub suggest that Auto-Pipeline can successfully synthesize 60-70% of these complex pipelines with up to 10 steps.

قواعد البيانات التعلم الآلي

Automating Data Science: Prospects and Challenges

353 - Tijl De Bie , Luc De Raedt , Jose Hernandez-Orallo 2021

Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process. Key insights: * Automation in data science aims to facilitate and transform t he work of data scientists, not to replace them. * Important parts of data science are already being automated, especially in the modeling stages, where techniques such as automated machine learning (AutoML) are gaining traction. * Other aspects are harder to automate, not only because of technological challenges, but because open-ended and context-dependent tasks require human interaction.

قواعد البيانات التعلم الآلي

Science Pipelines for the Square Kilometre Array

169 - Jamie Farnes , Ben Mort , Fred Dulwich 2018

The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of 5 zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources, and perform Faraday Tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.

الأجهزة والأساليب للزيئات الفيزياء الفلكية علم الكونيات والفيزياء الفلكية Nongalactic

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الشام الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Putting Data Science Pipelines on the Edge

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً