ﻻ يوجد ملخص باللغة العربية
Improving datacenter operations is vital for the digital society. We posit that doing so requires our community to shift, from operational aspects taken in isolation to holistic analysis of datacenter resources, energy, and workloads. In turn, this shift will require new analysis methods, and open-access, FAIR datasets with fine temporal and spatial granularity. We leverage in this work one of the (rare) public datasets providing fine-grained information on datacenter operations. Using it, we show strong evidence that fine-grained information reveals new operational aspects. We then propose a method for holistic analysis of datacenter operations, providing statistical characterization of node, energy, and workload aspects. We demonstrate the benefits of our holistic analysis method by applying it to the operations of a datacenter infrastructure with over 300 nodes. Our analysis reveals both generic and ML-specific aspects, and further details how the operational behavior of the datacenter changed during the 2020 COVID-19 pandemic. We make over 30 main observations, providing holistic insight into the long-term operation of a large-scale, public scientific infrastructure. We suggest such observations can help immediately with performance engineering tasks such as predicting future datacenter load, and also long-term with the design of datacenter infrastructure.
Workflows are prevalent in todays computing infrastructures. The workflow model support various different domains, from machine learning to finance and from astronomy to chemistry. Different Quality-of-Service (QoS) requirements and other desires of
Blue Waters is a Petascale-level supercomputer whose mission is to enable the national scientific and research community to solve grand challenge problems that are orders of magnitude more complex than can be carried out on other high performance com
Workload characterization is an integral part of performance analysis of high performance computing (HPC) systems. An understanding of workload properties sheds light on resource utilization and can be used to inform performance optimization both at
Serverless computing has emerged as an attractive deployment option for cloud applications in recent times. The unique features of this computing model include, rapid auto-scaling, strong isolation, fine-grained billing options and access to a massiv
Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with m