ﻻ يوجد ملخص باللغة العربية
We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead.
Due to its high performance and decreasing cost per bit, flash is becoming the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exh
The start of data taking at the Large Hadron Collider will herald a new era in data volumes and distributed processing in particle physics. Data volumes of hundreds of Terabytes will be shipped to Tier-2 centres for analysis by the LHC experiments us
Erasure codes are increasingly being studied in the context of implementing atomic memory objects in large scale asynchronous distributed storage systems. When compared with the traditional replication based schemes, erasure codes have the potential
In todays enterprise storage systems, supported data services such as snapshot delete or drive rebuild can cause tremendous performance interference if executed inline along with heavy foreground IO, often leading to missing SLOs (Service Level Objec
To achieve reliability in distributed storage systems, data has usually been replicated across different nodes. However the increasing volume of data to be stored has motivated the introduction of erasure codes, a storage efficient alternative to rep