Automating Data Science: Prospects and Challenges

354 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Chris Williams

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Tijl De Bie - Luc De Raedt - Jose Hernandez-Orallo

قواعد البيانات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process. Key insights: * Automation in data science aims to facilitate and transform the work of data scientists, not to replace them. * Important parts of data science are already being automated, especially in the modeling stages, where techniques such as automated machine learning (AutoML) are gaining traction. * Other aspects are harder to automate, not only because of technological challenges, but because open-ended and context-dependent tasks require human interaction.

قيم البحث

68 - Masoud Salehpour , Joseph G. Davis 2020

There is growing interest in the use of Knowledge Graphs (KGs) for the representation, exchange, and reuse of scientific data. While KGs offer the prospect of improving the infrastructure for working with scalable and reusable scholarly data consiste nt with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles, the state-of-the-art Data Management Systems (DMSs) for processing large KGs leave somewhat to be desired. In this paper, we studied the performance of some of the major DMSs in the context of querying KGs with the goal of providing a finely-grained, comparative analysis of DMSs representing each of the four major DMS types. We experimented with four well-known scientific KGs, namely, Allie, Cellcycle, DrugBank, and LinkedSPL against Virtuoso, Blazegraph, RDF-3X, and MongoDB as the representative DMSs. Our results suggest that the DMSs display limitations in processing complex queries on the KG datasets. Depending on the query type, the performance differentials can be several orders of magnitude. Also, no single DMS appears to offer consistently superior performance. We present an analysis of the underlying issues and outline two integrated approaches and proposals for resolving the problem.

قواعد البيانات

Industrial Big Data Analytics: Challenges, Methodologies, and Applications

72 - JunPing Wang , WenSheng Zhang , YouKang Shi 2018

While manufacturers have been generating highly distributed data from various systems, devices and applications, a number of challenges in both data management and data analysis require new approaches to support the big data era. These challenges for industrial big data analytics is real-time analysis and decision-making from massive heterogeneous data sources in manufacturing space. This survey presents new concepts, methodologies, and applications scenarios of industrial big data analytics, which can provide dramatic improvements in velocity and veracity problem solving. We focus on five important methodologies of industrial big data analytics: 1) Highly distributed industrial data ingestion: access and integrate to highly distributed data sources from various systems, devices and applications; 2) Industrial big data repository: cope with sampling biases and heterogeneity, and store different data formats and structures; 3) Large-scale industrial data management: organizes massive heterogeneous data and share large-scale data; 4) Industrial data analytics: track data provenance, from data generation through data preparation; 5) Industrial data governance: ensures data trust, integrity and security. For each phase, we introduce to current research in industries and academia, and discusses challenges and potential solutions. We also examine the typical applications of industrial big data, including smart factory visibility, machine fleet, energy management, proactive maintenance, and just in time supply chain. These discussions aim to understand the value of industrial big data. Lastly, this survey is concluded with a discussion of open problems and future directions.

قواعد البيانات

Data Management in Time-Domain Astronomy: Requirements and Challenges

336 - Chen Yang , Xiaofeng Meng , Zhihui Du 2018

In time-domain astronomy, we need to use the relational database to manage star catalog data. With the development of sky survey technology, the size of star catalog data is larger, and the speed of data generation is faster. So, in this paper, we ma ke a systematic and comprehensive introduction to process the data in time-domain astronomy, and valuable research questions are detailed. Then, we list candidate systems usually used in astronomy and point out the advantages and disadvantages of these systems. In addition, we present the key techniques needed to deal with astronomical data. Finally, we summarize the challenges faced by the design of our database prototype.

قواعد البيانات

Putting Data Science Pipelines on the Edge

103 - Ali Akoglu , Genoveva Vargas-Solar 2021

This paper proposes a composable Just in Time Architecture for Data Science (DS) Pipelines named JITA-4DS and associated resource management techniques for configuring disaggregated data centers (DCs). DCs under our approach are composable based on v ertical integration of the application, middleware/operating system, and hardware layers customized dynamically to meet application Service Level Objectives (SLO - application-aware management). Thereby, pipelines utilize a set of flexible building blocks that can be dynamically and automatically assembled and re-assembled to meet the dynamic changes in the workloads SLOs. To assess disaggregated DCs, we study how to model and validate their performance in large-scale settings.

قواعد البيانات

An analytical framework for data stream mining techniques based on challenges and requirements

364 - Mahnoosh Kholghi (Department of Electronic , Computer , IT 2011

A growing number of applications that generate massive streams of data need intelligent data processing and online analysis. Real-time surveillance systems, telecommunication systems, sensor networks and other dynamic environments are such examples. The imminent need for turning such data into useful information and knowledge augments the development of systems, algorithms and frameworks that address streaming challenges. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. Generally, two main challenges are designing fast mining methods for data streams and need to promptly detect changing concepts and data distribution because of highly dynamic nature of data streams. The goal of this article is to analyze and classify the application of diverse data mining techniques in different challenges of data stream mining. In this paper, we present the theoretical foundations of data stream analysis and propose an analytical framework for data stream mining techniques.

قواعد البيانات