ﻻ يوجد ملخص باللغة العربية
Microsofts internal big-data infrastructure is one of the largest in the world -- with over 300k machines running billions of tasks from over 0.6M daily jobs. Operating this infrastructure is a costly and complex endeavor, and efficiency is paramount. In fact, for over 15 years, a dedicated engineering team has tuned almost every aspect of this infrastructure, achieving state-of-the-art efficiency (>60% average CPU utilization across all clusters). Despite rich telemetry and strong expertise, faced with evolving hardware/software/workloads this manual tuning approach had reached its limit -- we had plateaued. In this paper, we present KEA, a multi-year effort to automate our tuning processes to be fully data/model-driven. KEA leverages a mix of domain knowledge and principled data science to capture the essence of our cluster dynamic behavior in a set of machine learning (ML) models based on collected system data. These models power automated optimization procedures for parameter tuning, and inform our leadership in critical decisions around engineering and capacity management (such as hardware and data center design, software investments, etc.). We combine observational tuning (i.e., using models to predict system behavior without direct experimentation) with judicious use of flighting (i.e., conservative testing in production). This allows us to support a broad range of applications that we discuss in this paper. KEA continuously tunes our cluster configurations and is on track to save Microsoft tens of millions of dollars per year. At the best of our knowledge, this paper is the first to discuss research challenges and practical learnings that emerge when tuning an exabyte-scale data infrastructure.
In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are colle
Given a replicated database, a divergent design tunes the indexes in each replica differently in order to specialize it for a specific subset of the workload. This specialization brings significant performance gains compared to the common practice of
We document the data transfer workflow, data transfer performance, and other aspects of staging approximately 56 terabytes of climate model output data from the distributed Coupled Model Intercomparison Project (CMIP5) archive to the National Energy
Performance tuning of Database Management Systems(DBMS) is both complex and challenging as it involves identifying and altering several key performance tuning parameters. The quality of tuning and the extent of performance enhancement achieved greatl
Data mining has been widely recognized as a powerful tool to explore added value from large-scale databases. Finding frequent item sets in databases is a crucial in data mining process of extracting association rules. Many algorithms were developed t