ترغب بنشر مسار تعليمي؟ اضغط هنا

Vamsa: Automated Provenance Tracking in Data Science Scripts

393   0   0.0 ( 0 )
 نشر من قبل Subru Krishnan
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

There has recently been a lot of ongoing research in the areas of fairness, bias and explainability of machine learning (ML) models due to the self-evident or regulatory requirements of various ML applications. We make the following observation: All of these approaches require a robust understanding of the relationship between ML models and the data used to train them. In this work, we introduce the ML provenance tracking problem: the fundamental idea is to automatically track which columns in a dataset have been used to derive the features/labels of an ML model. We discuss the challenges in capturing such information in the context of Python, the most common language used by data scientists. We then present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the users code. Using 26K real data science scripts, we verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsas accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsas precision and recall range from 90.4% to 99.1% and its latency is in the order of milliseconds for average size scripts. Drawing from our experience in deploying ML models in production, we also present an example in which Vamsa helps automatically identify models that are affected by data corruption issues.

قيم البحث

اقرأ أيضاً

The recent success of machine learning (ML) has led to an explosive growth both in terms of new systems and algorithms built in industry and academia, and new applications built by an ever-growing community of data science (DS) practitioners. This qu ickly shifting panorama of technologies and applications is challenging for builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, by performing the largest analysis of DS projects to date, focusing on questions that can help determine investments on either side. Specifically, we download and analyze: (a) over 6M Python notebooks publicly available on GITHUB, (b) over 2M enterprise DS pipelines developed within COMPANYX, and (c) the source code and metadata of over 900 releases from 12 important DS libraries. The analysis we perform ranges from coarse-grained statistical characterizations to analysis of library imports, pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret, and dare to draw a few (actionable, yet subjective) conclusions on (a) what systems builders should focus on to better serve practitioners, and (b) what technologies should practitioners bet on given current trends. We plan to automate this analysis and release associated tools and results periodically.
For data-centric systems, provenance tracking is particularly important when the system is open and decentralised, such as the Web of Linked Data. In this paper, a concise but expressive calculus which models data updates is presented. The calculus i s used to provide an operational semantics for a system where data and updates interact concurrently. The operational semantics of the calculus also tracks the provenance of data with respect to updates. This provides a new formal semantics extending provenance diagrams which takes into account the execution of processes in a concurrent setting. Moreover, a sound and complete model for the calculus based on ideals of series-parallel DAGs is provided. The notion of provenance introduced can be used as a subjective indicator of the quality of data in concurrent interacting systems.
Regression problems that have closed-form solutions are well understood and can be easily implemented when the dataset is small enough to be all loaded into the RAM. Challenges arise when data is too big to be stored in RAM to compute the closed form solutions. Many techniques were proposed to overcome or alleviate the memory barrier problem but the solutions are often local optimal. In addition, most approaches require accessing the raw data again when updating the models. Parallel computing clusters are also expected if multiple models need to be computed simultaneously. We propose multiple learning approaches that utilize an array of sufficient statistics (SS) to address this big data challenge. This memory oblivious approach breaks the memory barrier when computing regressions with closed-form solutions, including but not limited to linear regression, weighted linear regression, linear regression with Box-Cox transformation (Box-Cox regression) and ridge regression models. The computation and update of the SS array can be handled at per row level or per mini-batch level. And updating a model is as easy as matrix addition and subtraction. Furthermore, multiple SS arrays for different models can be easily computed simultaneously to obtain multiple models at one pass through the dataset. We implemented our approaches on Spark and evaluated over the simulated datasets. Results showed our approaches can achieve closed-form solutions of multiple models at the cost of half training time of the traditional methods for a single model.
Microsofts internal big data analytics platform is comprised of hundreds of thousands of machines, serving over half a million jobs daily, from thousands of users. The majority of these jobs are recurring and are crucial for the companys operation. A lthough administrators spend significant effort tuning system performance, some jobs inevitably experience slowdowns, i.e., their execution time degrades over previous runs. Currently, the investigation of such slowdowns is a labor-intensive and error-prone process, which costs Microsoft significant human and machine resources, and negatively impacts several lines of businesses. In this work, we present Griffin, a system we built and have deployed in production last year to automatically discover the root cause of job slowdowns. Existing solutions either rely on labeled data (i.e., resolved incidents with labeled reasons for job slowdowns), which is in most cases non-existent or non-trivial to acquire, or on time-series analysis of individual metrics that do not target specific jobs holistically. In contrast, in Griffin we cast the problem to a corresponding regression one that predicts the runtime of a job, and show how the relative contributions of the features used to train our interpretable model can be exploited to rank the potential causes of job slowdowns. Evaluated over historical incidents, we show that Griffin discovers slowdown causes that are consistent with the ones validated by domain-expert engineers, in a fraction of the time required by them.
Federated learning (FL) is an emerging, privacy-preserving machine learning paradigm, drawing tremendous attention in both academia and industry. A unique characteristic of FL is heterogeneity, which resides in the various hardware specifications and dynamic states across the participating devices. Theoretically, heterogeneity can exert a huge influence on the FL training process, e.g., causing a device unavailable for training or unable to upload its model updates. Unfortunately, these impacts have never been systematically studied and quantified in existing FL literature. In this paper, we carry out the first empirical study to characterize the impacts of heterogeneity in FL. We collect large-scale data from 136k smartphones that can faithfully reflect heterogeneity in real-world settings. We also build a heterogeneity-aware FL platform that complies with the standard FL protocol but with heterogeneity in consideration. Based on the data and the platform, we conduct extensive experiments to compare the performance of state-of-the-art FL algorithms under heterogeneity-aware and heterogeneity-unaware settings. Results show that heterogeneity causes non-trivial performance degradation in FL, including up to 9.2% accuracy drop, 2.32x lengthened training time, and undermined fairness. Furthermore, we analyze potential impact factors and find that device failure and participant bias are two potential factors for performance degradation. Our study provides insightful implications for FL practitioners. On the one hand, our findings suggest that FL algorithm designers consider necessary heterogeneity during the evaluation. On the other hand, our findings urge system providers to design specific mechanisms to mitigate the impacts of heterogeneity.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا