A Framework for Auditing Data Center Energy Usage and Mitigating Environmental Footprint

47 0 0.0 ( 0 )

Download Cite

Added by Justin Gould

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Justin Gould

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

As the Data Science field continues to mature, and we collect more data, the demand to store and analyze them will continue to increase. This increase in data availability and demand for analytics will put a strain on data centers and compute clusters-with implications for both energy costs and emissions. As the world battles a climate crisis, it is prudent for organizations with data centers to have a framework for combating increasing energy costs and emissions to meet demand for analytics work. In this paper, I present a generalized framework for organizations to audit data centers energy efficiency to understand the resources required to operate a given data center and effective steps organizations can take to improve data center efficiency and lower the environmental impact.

rate research

Analyzing and Mitigating Data Stalls in DNN Training

90 - Jayashree Mohan , Amar Phanishayee , Ashish Raniwala 2020

Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and preprocessed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).

Distributed Parallel and Cluster Computing Machine Learning Operating Systems

A Holistic Analysis of Datacenter Operations: Resource Usage, Energy, and Workload Characterization -- Extended Technical Report

86 - Laurens Versluis , Mehmet Cetin , Caspar Greeven 2021

Improving datacenter operations is vital for the digital society. We posit that doing so requires our community to shift, from operational aspects taken in isolation to holistic analysis of datacenter resources, energy, and workloads. In turn, this shift will require new analysis methods, and open-access, FAIR datasets with fine temporal and spatial granularity. We leverage in this work one of the (rare) public datasets providing fine-grained information on datacenter operations. Using it, we show strong evidence that fine-grained information reveals new operational aspects. We then propose a method for holistic analysis of datacenter operations, providing statistical characterization of node, energy, and workload aspects. We demonstrate the benefits of our holistic analysis method by applying it to the operations of a datacenter infrastructure with over 300 nodes. Our analysis reveals both generic and ML-specific aspects, and further details how the operational behavior of the datacenter changed during the 2020 COVID-19 pandemic. We make over 30 main observations, providing holistic insight into the long-term operation of a large-scale, public scientific infrastructure. We suggest such observations can help immediately with performance engineering tasks such as predicting future datacenter load, and also long-term with the design of datacenter infrastructure.

Distributed Parallel and Cluster Computing

Energy Footprint of Blockchain Consensus Mechanisms Beyond Proof-of-Work

146 - Moritz Platt , Johannes Sedlmeir , Daniel Platt 2021

Popular distributed ledger technology (DLT) systems using proof-of-work (PoW) for Sybil attack resistance have extreme energy requirements, drawing stern criticism from academia, businesses, and the media. DLT systems building on alternative consensus mechanisms, foremost proof-of-stake (PoS), aim to address this downside. In this paper, we take a first step towards comparing the energy requirements of such systems to understand whether they achieve this goal equally well. While multiple studies have been undertaken that analyze the energy demands of individual Blockchains, little comparative work has been done. We approach this research question by formalizing a basic consumption model for PoS blockchains. Applying this model to six archetypal blockchains generates three main findings: First, we confirm the concerns around the energy footprint of PoW by showing that Bitcoins energy consumption exceeds the energy consumption of all PoS-based systems analyzed by at least three orders of magnitude. Second, we illustrate that there are significant differences in energy consumption among the PoSbased systems analyzed, with permissionless systems having an overall larger energy footprint. Third, we point out that the type of hardware that validators use has a considerable impact on whether PoS blockchains energy consumption is comparable with or considerably larger than that of centralized, non-DLT systems.

Distributed Parallel and Cluster Computing

A Visual Analytics Framework for Reviewing Streaming Performance Data

99 - Suraj P. Kesavan , Takanori Fujiwara , Jianping Kelvin Li 2020

Understanding and tuning the performance of extreme-scale parallel computing systems demands a streaming approach due to the computational cost of applying offline algorithms to vast amounts of performance log data. Analyzing large streaming data is challenging because the rate of receiving data and limited time to comprehend data make it difficult for the analysts to sufficiently examine the data without missing important changes or patterns. To support streaming data analysis, we introduce a visual analytic framework comprising of three modules: data management, analysis, and interactive visualization. The data management module collects various computing and communication performance metrics from the monitored system using streaming data processing techniques and feeds the data to the other two modules. The analysis module automatically identifies important changes and patterns at the required latency. In particular, we introduce a set of online and progressive analysis methods for not only controlling the computational costs but also helping analysts better follow the critical aspects of the analysis results. Finally, the interactive visualization module provides the analysts with a coherent view of the changes and patterns in the continuously captured performance data. Through a multi-faceted case study on performance analysis of parallel discrete-event simulation, we demonstrate the effectiveness of our framework for identifying bottlenecks and locating outliers.

Distributed Parallel and Cluster Computing Human-Computer Interaction Machine Learning

BigDL: A Distributed Deep Learning Framework for Big Data

187 - Jason Dai , Yiheng Wang , Xin Qiu 2018

This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management. Unlike existing deep learning frameworks, BigDL implements distributed, data parallel training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. We also share real-world experience and war stories of users that have adopted BigDL to address their challenges(i.e., how to easily build end-to-end data analysis and deep learning pipelines for their production data).

Distributed Parallel and Cluster Computing Artificial Intelligence Machine Learning

comments

Fetching comments

University of Aleppo

Additional details More universities

A Framework for Auditing Data Center Energy Usage and Mitigating Environmental Footprint

Ask ChatGPT about the research

No Arabic abstract

Read More