Steve Loughran: Towards a Topology of Failure

The Apache community is not just the mailing lists and the get togethers: it is the planet apache aggregate blog; this lets other committers share their thoughts -and look at yours. This makes for unexpected connections.

After I posted my comments on Availability in Globally Distributed Storage Systems, Phil Steitz posted wonderful article on the mathematics behind it. This impressed me, not least because of his ability to get TeX-grade equations into HTML. What he did do was look at the real theory behind it, and even attempted to implement Dynamic Programming solution the problem.

I'm not going to be that ambitious, but I will try and link this paper -and the other ones on server failures, into a new concept, "Failure Topology". This is an excessively pretentious phrase, but I like it -if it ever takes off I can claim I was the first person to use it, as I can do with "continuous deployment"

The key concept of Failure Topology is that failures of systems often follow topologies. Rack outages can be caused by rack-level upgrades. Switch-level outages can take out 1* racks and are are driven by the network topology. Power outages can be caused by the failure of power supplies to specific servers, sets of servers, specific racks or even quadrants of a datacentre. There's the also the notion of specific hardware instances, such as server batches with the same run of HDDs or CPU revisions.

Failure topologies, then, are maps of the computing infrastructure that show how these things are related, and where the risk lies. A power topology would be a map of the power input to the datacentre. We have the switch topology for Hadoop, but it is optimised for network efficiency, rather than looking at the risk of switch failure. A failure-aware topology would need to know which racks were protected by duplicate ToR switches and view them as less at risk then single-switch racks. Across multiple sites you'd need to look at the regional power grids, the different telcos. Then is the politics overlay: what government controls the datacentre sites; whether or not that government is part of the EU and hence has data protection rights, or whether there's some DMCA-style takedown rules.

You'd also need to look at physical issues: fault lines, whether the sites were downwind of Mt St Helen's class volcanoes. That goes from abstract topologies to physical maps.

What does all this mean? Well, in Disk-Locality in Datacenter Computing Considered Irrelevant, Ganesha Ananthanarayanan argues that as switch and backplane bandwidth increases you don't have to worry about where your code runs relative to the data. I concur: with 10GbE and emerging backplanes, network bandwidth means that switch-local vs switch-remote will become less important. Which means you can stop worrying about Hadoop topology scripts driving code execution policies. Doing this now opens a new possibility:

Writing a script to model the failure topology of the datacentre.

You want to move from a simple "/switch2/node21" map to one that includes power sources, switches, racks, shared-PSUs in servers, something like "/ups1/switch2/rack3/psu2/node21". This models not the network hierarchy, but the failure topology of the site. Admittedly, it assumes that switches and racks share the same UPS, but if the switch power source goes away, the rack is partitioned and effectively offline anyway -so you may as well do that.

I haven't played with this yet, but as I progress with my patch to allow Hadoop to focus on keeping all blocks on separate racks for availability, this failure topology notion could be the longer term topology that the ops team need to define.

[Photo: sunrise from the snowhole on the (dormant) volcano Mt Hood -not far from Google's Dalles datacentre]

Steve Loughran

2011-11-13

Towards a Topology of Failure

No comments:

Post a Comment