Thoughts on Systems

Emil Sit

Apr 30, 2008 - 1 minute read - Research Technology failures hosting

Characterizing failures in a data center

Part of my research has been investigating how to build storage systems that can provide availability and durability despite failures. It’s been interesting to see recent papers that characterize failures, such as Ethan Katz-Bassett’s NSDI paper about Hubble, or last year’s papers about drive failure characteristics from Google and from several high performance computing facilities. Today, while catching up on reading High Scalability, I came across a Jeff Dean presentation about scalability at Google, which includes fascinating anecdotal tidbits about failures over a year in a typical Google data center, with frequency, impact and recovery time, such as:

  • 0.5 overheating events: power down most machines in <5 mins, ~1–2 days to recover
  • 1 PDU failure: 500–1000 machines suddenly disappear, ~6 hours to recover.
  • 1 network rewiring: rolling 5% of machines down over a 2 day span.
  • 5 racks go wonky: 40–80 machines see 50% packet loss.

His presentation includes several other classes of rack, network, and machine failures that you can expect to see with real hardware, and that scalable, distributed systems have to cope with and hopefully mask. For the full list of failures, you can view the presentation (1.2 Mbyte PDF) or the video. I wonder how well Chord/DHash would fare in such an environment…