I am entertaining myself in test runs by reading Ford et al's paper, Availability in Globally Distributed Storage Systems, from Google. This is a fantastic paper and it shows how large datacentre datasets themselves give researchers a great advantage, so it's good that we can all read it now it's been published.
- storage node failures are not independent
- most failures last less than 15 minutes, so the liveness protocol should be set up to not worry before then
- most correlated failures are rack-local, meaning whole racks, switches, rack UPSs fail.
- Managed operations (OS upgrades &c) cause a big chunk of outages.
Point #3 is. The paper argues that if you store all copies of the data on different racks, you get far better resilience than storing multiple copies of the data on a single rack -that being exactly what Hadoop does. Hadoop has some hard-coded rules that say "two copies per rack are OK", which is done to save bandwidth on the backplane.
Now that switches that offer fantastic backplane bandwidth are available from vendors like HP at prices that Cisco wouldn't dream of, rack locality matters less. What you save in bandwidth you lose in availability. Lose a single rack and you have to make 2x copies of every block that was created in that rack and that now only has one copy, and that is where you are vulnerable to data loss.
That needs to be fixed, either in 0.23 or its successor.