Alternate Hadoop filesystems

Hillgrove Porter House: Smoke room

My coverage of Netapp's Hadoop story not only generated a blip in network traffic to the site, but a blip in people from NetApp viewing my LinkedIn profile (remember: the mining cuts both ways), including one Val Bercovici -I hope nobody was too offended. That's why I stuck in a disclaimer about my competence*.

I'm not against running MapReduce -or the entire Hadoop stack- against alternate filesystems. There are some good cases where it makes sense. Other filesystems offer security, NFS mounting, the ability to be used by other applications and other features. HDFS is designed to scale well on "commodity" hardware, (where servers containing Xeon E5 series parts with 64+GB RAM, 10GbE and 8-12 SFF HDDs are considered a subset of "commodity"). HDFS makes some key assumptions
  1. The hardware is unreliable --replication addresses this.
  2. Disk and network bandwidth is limited --again, replication addresses this.
  3. Failures are mostly independent, therefore replication is a good strategy for avoiding data loss
  4. Sub-POSIX semantics are all you need --makes handling partitioning easier, locking, etc.
The independence of failure is something I will have to look at some other time, the key thing being it doesn't extend to virtualised "cloud" infrastructures" where all your VMs may be hosted on the same physical machine. A topic for another day.

What I will point to instead is a paper by Xyratex looking at running Hadoop MR jobs over Lustre. I'm not actually going to argue with this paper (much) as it does appear to show substantial performance benefits with the combination of Lustre + Infiniband. They then cost it out and argue that for the same amount of storage, the RAIDed LustreFS capacity is less and that delivers long term power savings as well as purchase price.

I don't know if the performance numbers stand up, but at the very least Lustre does offer other features that HDFS doesn't: it's more general purpose, NFS mountable, supports many small files, and, with the right network, delivers lots of data to anywhere in the cluster. Also Eric Barton of Whamcloud bought us beer at the Hillgrove at our last Hadoop get together.

Cost wise, I look at those number and don't know what to make of them. On page 12 of their paper (#14 in the PDF file) and say that you only need to spend 100x$7500 for the number of nodes in the cluster, so the extra costs of Infiniband networking , and $104K for the storage are justifiable, as the total cluster capital cost comes in at half, and the power budget would be half too, leading to lower running costs. These tables I would question.

They've achieved much of the cost saving by saying "oh look, you only need half as many servers!" True, but that's cut your computation power in half too. You'd have to look hard at the overhead imposed by the datanode work on your jobs to be sure that you really can go down to half as many disks. Oh, and each node would gain from having a bit of local FS storage for logs and overspill data, because it costs more in the big servers, and local storage is fine there.

IBM have stood Hadoop up against GPFS. This makes sense if you have an HPC cluster with a GPFS filesystem nearby,  want an easier programming model than MPI --or the ability to re-use the Hadoop++ layers. GPFS delivers fantastic bandwidth to any node in the cluster, it just makes the cost of storage high. You may want to consider having HDDs in the nodes, using that for the low value bulk data, and using GPFS for output data, commonly read data, and anything where your app does want to seek() a lot. There's nothing in the DFS APIs that stop you having the job input or output FS separate from your fs.default.name, after all.
When running Hadoop in AWS EC2, again the S3 FS is a good ingress/egress FS. Nobody uses it for the intermediate work, but the pay by the kilo storage costs are lower than the blockstore rates, and where you want to keep the cold data.

There's also the work my colleague Johannes did on deploying Hadoop inside the IBRIX filesystem:
This design doesn't try and deploy HDFS above an existing RAIDed storage system; it runs location-aware work inside the filesystem itself. What does that give you?
  • The existing FS features: POSIX, mounting, etc.
  • Different scalability behaviours.
  • Different failure modes: bits of the FS can go offline, and if one server fails another can take over that storage
  • RAID-level replication, not 3X.
  • Much better management tooling.
I don't know what the purchasing costs would be -talk to someone in marketing there- but I do know the technical limits.
  • Less places to run code next to the data, so more network traffic.
  • Rack failures would take data offline completely until it returned. And as we know, rack failures are correlated
  • Classic filesystems are very fussy about OS versions and testing, which may create conflict between the needs of the Hadoop layers and the OS/FS.
Even so: all these designs can make sense, depending on your needs. What I was unimpressed by was trying to bring up HDFS on top of hardware that offered good availability guarantees anyway -because HDFS is designed to expect and recover from failure of medium availability hardware.

(*) For anyone doubts those claims about my competence and the state of the kitchen after I bake anything would have changed their opinions after seeing what happened when I managed to drop a glass pint bottle of milk onto a counter in the middle of the kitchen. I didn't know milk could travel that far.

[Photo: Hillgrove Porter Stores. A fine drinking establishment on the hill above Stokes Croft. Many beers are available, along with cheesy chips]


Attack Surface

Pedestrians: push button for a cup of tea

I allege that
  1. Any program that is capable of parsing untrusted content is vulnerable to exploitation by malicious entities.
  2. Programs which parse binary file contents are particularly vulnerable due to the tendency of such formats to use offset pointers and other values which the parsers assume are valid -but which "fuzzing" can be used to discover new exploits.
  3. Programs that treat the data as containing executable content -even within a sandbox- are vulnerable to any exploit that permits code to escape the sandbox, or simply to Denial of Service attacks in which resources in the sandbox such as memory, CPU time and network bandwidth are consumed.
  4. Recent incidents involving the generation of false certificates for popular web sites have shown that signed web sites and signed content cannot be consistently trusted.
  5. Any program that can process content downloaded by an HTTP client application is capable of parsing untrusted content, and so vulnerable to exploitation.
  6. Any program that can process content received in the payload of email messages is capable of parsing untrusted content, and so vulnerable to exploitation.
  7. Any program that can process delivered in untrusted physical media is capable of parsing untrusted content, and so vulnerable to exploitation.
The last of these items concerns me today. The Stuxnet worm has shown how a well-engineered piece of malware can propagate between windows machines, and across an airgap from windows machines to SCADA infrastructure by way of USB infection. The Sony Rootkit Scandal showed that even large business were willing to use this to sneak on anti-copying features on what pretended to be a music CD.

A key conclusion from the CERT advisory on the Sony Incident was Disable automatically running CD-ROMs by editing the registry to change the Autorun value to 0 (zero) as described in Microsoft Article 155217.

I've long disabled autorun on windows boxes, and I also remember testing a long time ago whether or not you could autorun a CD while the screen was locked. Answer: no. Windows waits until the screen is unlocked before an autorun application is executed. This provides some defence against malicious CDs, though not as much as disabling autorun completely.

I don't worry too much about windows security these days because my main machines are Linux boxes, which keep themselves up to date with apt-get update. This keeps browsers, flash, JVMs, OpenOffice current, which reduces their vulnerability. I also keep an eye on the SANS newsfeed to see if there are any 0-day exploits out, and take proactive action if anything gets out into the wild that we are at risk from. I've stopped fielding service requests from family members saying "there is something on my machine -can you clean it up". If I am asked, my response is "I will reformat your HDD and install ubuntu on it". FWIW, I have done this for one relative and it works really well.

I am confident that I do keep my machines locked down, and know that the primary vulnerability on an Linux system on which I do open source development is that I implicitly have to trust everyone who has commit rights to every application that I use on it.

If there is one issue I have with Linux is that it is over-paranoid. For example, once the screen is locked I can't suspend the system. This means that I can't suspend the laptop until I unlock it and hit the suspend button. Somewhere on launchpad there was a bugrep about that, but people who looked after servers were saying "no!", despite the fact that if I had physical access to the power button, it would be only four-seconds away from a hard power off, and then, unless the system had the CD-ROM, USB and PXE boot disabled, one to two minutes from being my box. It's annoying but a sign that security is taken over-seriously.

Imagine my surprise then, last week, on inserting a CD-ROM from an external source when suddenly the system started grinding away, with a tiny thumbnail of each PDF image appearing one after the other. For that to happen the desktop had to have scanned through the list of files, handed each PDF file off to some parser and got a 64x64 graphic back. It wasn't the wasted effort that concerned me -it was the way it did this without me even asking.

I did some research and discovered an article on the topic. Essentially the Linux desktop standard allows for autostart.sh files to be executed on mounting a media device. The the thumbnailing is something else: nautilus defaults to doing this for local files -and views all files on mounted local media as "local", rather than "untrusted content on an untrusted filesystem". Given that exploits have been found in the thumbnailing code, having Nautilus thumbnail files on an external device is insane.

The best bit: this happens even while the desktop is locked.

There I am, regularly saying rude words because I thought I'd suspended my laptop, but as the screen was locked it's just stayed powered up, slowly killing the battery and/or overheading in a bag. I accepted this inconvenience as I thought it was a security feature for my own good. It turns out that it was irrelevant as all you needed to do was plug in a USB stick with a any .dvi, .png or .pdf file designed to exploit some security hole and my box is 0wned. Which, if done while I am logged in, offers access to the encrypted bits of the box.

There is a workaround -disable all thumbnailing even of local content, as well as the usual "do nothing" action on handling media insertions. I've done that everywhere. Yet what annoys me is this: what were the dev team thinking. On the one hand there is the power management group saying "we won't let people suspend while the desktop is locked for security reasons", while some other people are thinking "let's generate thumbnails whenever new media is inserted". The strength of OSS is that anyone can contribute. The weakness: a lot of them aren't devious enough to recognise what they think of a feature is someone else's route of access to your systems

[Photo: Park Street]


Disclaimer: I reserve the right to not know what I'm talking about

break time

There have been some links to this blog with the implications that either because I am a researcher at HP Laboratiories with commit rights to a number of projects -including Apache Hadoop- then my opinions may have some value.

Not so.

Everything I say is obviously my own opinions, do not reflect those of my employer, may be at odds with corporate strategies of which I am blissfully unaware of, if I do mention corporate plans they may change on a whim, etc, etc.

I do work at HP Labs -which is a fun place to work- but think about what that means. It means that I don't do stuff that goes straight into shipping products. The code I write is not considered safe for humanity. I may talk about it, I may write about it, some may get into the field, but only after rigorous review by people who care about such things.

This makes it a bit like a university, except that I don't teach courses. Cleary I am not considered safe near students either, in case I damage thier minds.

"Oh well", people say, "he's a Hadoop committer -that must count for something?"

It does, but I don't get to spend enough time there to keep up with the changes, let alone do interesting stuff of my own ideas.

Whenever I create an issue and say "I will do this", other people in the project -Arun, Tom, Todd and others will stare at their inbox in despair, with the same expression on their faces that my family adopts when I say "I will bake something".

Because then, yes, I may come up with something unusual and possibly even tasty, but they will know that the kitchen will be left with a thin layer of flour over everything, there will be pools of milk, and whatever went in the oven went into the closest bowl-shaped thing I could reach with one hand, rather than anything designed to be used in an oven, let alone watertight. Even if I didn't use flour or milk.

Bear that in mind before believing anything written here.

[Photo:me somewhere in the Belledone mountains; French Alps. For some reason French villages never like it when people on bicycles come to a halt]


Towards a Topology of Failure

Mt Hood Expedition

The Apache community is not just the mailing lists and the get togethers: it is the planet apache aggregate blog; this lets other committers share their thoughts -and look at yours. This makes for unexpected connections.

After I posted my comments on Availability in Globally Distributed Storage Systems, Phil Steitz posted wonderful article on the mathematics behind it. This impressed me, not least because of his ability to get TeX-grade equations into HTML. What he did do was look at the real theory behind it, and even attempted to implement Dynamic Programming solution the problem.

I'm not going to be that ambitious, but I will try and link this paper -and the other ones on server failures, into a new concept, "Failure Topology". This is an excessively pretentious phrase, but I like it -if it ever takes off I can claim I was the first person to use it, as I can do with "continuous deployment"

The key concept of Failure Topology is that failures of systems often follow topologies. Rack outages can be caused by rack-level upgrades. Switch-level outages can take out 1* racks and are are driven by the network topology. Power outages can be caused by the failure of power supplies to specific servers, sets of servers, specific racks or even quadrants of a datacentre. There's the also the notion of specific hardware instances, such as server batches with the same run of HDDs or CPU revisions.

Failure topologies, then, are maps of the computing infrastructure that show how these things are related, and where the risk lies. A power topology would be a map of the power input to the datacentre. We have the switch topology for Hadoop, but it is optimised for network efficiency, rather than looking at the risk of switch failure. A failure-aware topology would need to know which racks were protected by duplicate ToR switches and view them as less at risk then single-switch racks. Across multiple sites you'd need to look at the regional power grids, the different telcos. Then is the politics overlay: what government controls the datacentre sites; whether or not that government is part of the EU and hence has data protection rights, or whether there's some DMCA-style takedown rules.

You'd also need to look at physical issues: fault lines, whether the sites were downwind of Mt St Helen's class volcanoes. That goes from abstract topologies to physical maps.

What does all this mean? Well, in Disk-Locality in Datacenter Computing Considered Irrelevant, Ganesha Ananthanarayanan argues that as switch and backplane bandwidth increases you don't have to worry about where your code runs relative to the data. I concur: with 10GbE and emerging backplanes, network bandwidth means that switch-local vs switch-remote will become less important. Which means you can stop worrying about Hadoop topology scripts driving code execution policies. Doing this now opens a new possibility:

Writing a script to model the failure topology of the datacentre.

You want to move from a simple "/switch2/node21" map to one that includes power sources, switches, racks, shared-PSUs in servers, something like "/ups1/switch2/rack3/psu2/node21". This models not the network hierarchy, but the failure topology of the site. Admittedly, it assumes that switches and racks share the same UPS, but if the switch power source goes away, the rack is partitioned and effectively offline anyway -so you may as well do that.

I haven't played with this yet, but as I progress with my patch to allow Hadoop to focus on keeping all blocks on separate racks for availability, this failure topology notion could be the longer term topology that the ops team need to define.

[Photo: sunrise from the snowhole on the (dormant) volcano Mt Hood -not far from Google's Dalles datacentre]


Solving the Netapp Open Solution for Hadoop Solutions Guide

See No Evil

I have been staring at the Netapp Open Solution for Hadoop Solutions Guide. I now have a headache.

Where do I begin? It's interesting to read a datasheet about Hadoop that is targeted at enterprise filesystem people. Not management "Hadoop solves all your problems", not developer "you have to rewrite all your apps", but something designed to convince people who worry about file systems that HDFS is a wrongness on the face of the planet. That's the only reason it would make sense. And like anyone trying to position a competing product to HDFS they have to completely discredit HDFS.

This paper, which tries to rejustify splitting the storage and compute platforms -the whole thing that the HDFS/Hadoop is designed to eliminate on cost grounds- has to try pretty hard.

In fact, I will say that after reading this paper that the MapR/EMC story makes a lot more sense. As what the Netapp paper is trying to say -unlike EMC-  is "the Big Data platform that Hadoop is wonderful, provided you ignore the cost model of SATA-on-server that HDFS and equivalent filesystems offer for topology-aware applications".

This must have been a hard article to write, so I must give the marketing team some credit for the attempt, even if they got it so wrong technically.

First: did we really need new acronyms? Really? By Page 4 there is a new one, "REPCOUNT", that I have never come across before. Then the acronym "HDFS3" that I first thought meant version 3 of HDFS, but before I could switch to JIRA and see if someone had closed an issue "design, implement and release V3 of HDFS", I see it really means "A file saved to HDFS with the block.replication.factor attribute set to 3", or more simply, 3x replicated blocks. No need to add another acronym to a document that is knee deep in them.

Now, for some more serious criticisms, starting at page 5

"For each 1 petabyte of data, 3 terabytes of raw capacity are needed".

If that were true, this would not be a criticism. No, it would be a sign of someone getting such a fantastic compression ration that Facebooks leading edge 30 PB server would fit into 30x3TB LFF HDDs, which you could fit into about 3U's worth of rack. If only. And don't forget that RAID uses 1.6X raw storage: it replicates too, just more efficiently.

Then there's the ingest problem. A whole section of this paper is dedicated to scaring people that they won't be able to pull in data fast enough because the network traffic will overload the system. We are not -outside my home network- using Ether over Power as the backplane any more. You can now get 46 servers with 12x3TB HDDs in them onto a single rack, with a 10GbE switch that run "with all the lights on". On a rack like that - which offers 500+TB of storage- you can sustain 10 GbE between any two servers. If the ingress and egress servers are hooked up to the same switch you could in theory move a Gigabyte per second between any two nodes or in and out the cluster. Nobody has an ingress right in that range, except maybe the physicists and their "odd" experiments. Everyone else can predict their ingress rate fairly simply, it is "look around at the amount of data you discard today". That's your ingress rate. Even a terabyte per day is pretty significant -yet on a 10GbE switched you could possibly import that from a single in-centre-off-HDFS-node in under 20 minutes. For anything out of the datacentre, your bandwidth will probably be less than 10 Gigabits, unless your name is Facebook, Yahoo!, Google or Microsoft.

Summary: ingress rate is unlikely to be bottlenecked by in-rack bandwidth even with 3x replication.

Next: Intra-HDFS bandwidth issues.

The paper says today that "A typical Hadoop node has 2GbE interfaces". No. A typical node tends to have a single 1GbE interface; bonded 2x1 GbE is rarer, as it's harder to commission.
Go that way and add on separate ToR switches and all concerns about "network switch failure" go away. You'd have to lose two network switches or power to the entire rack to trigger re-replication. At least NetApp did notice that 10GbE is going on the mainboard and said "Customers building modest clusters (of less than 128 nodes) with modern servers should consider 10GbaseT network topologies."

I'd say something different, which is "128 nodes is not a modest cluster". If you are building your rack from paired six-core CPUs with 12 LFF HDDs each with 3TB of SATA (how's that for acronyms?), then your cluster has 1536 individual cores. It will have 12*3*128 = 4608TB of storage: four petabytes. That's is not modest. That is something that can sort a petabyte of data with. Fit that up with a good 10GbE switch and enough RAM -and the latest 0.23 alpha release with Todd's CRC32c-in-SSE patch- and you could probably announce you have just won the petabyte sorting record.

Summary: "a 128 node cluster built of current CPUs and SATA storage is not modest".

It's a modest size -three racks- but it will probably be the largest file system your organisation owns. Think about that. And think about the fact that when people say "big data" what they mean is "lots of low value data". If you don't store all of it (aggressively compressed), you will regret it later.

Page 6: Drive failures. This  page and its successors wound me up so much I had to stop. You see, I've been slowly doing some changes to trunk and 0.23 to tune HDFS's failure handling here, based on papers on failures by Google, Microsoft and -um- Netapp. This means I am starting to know more about the subject, and defend my statements.

The Netapp paper discusses the consequences of HDD Failure rates using the failure rate on 5K-node cluster to scare people. Yes, there will be a high failure rate there, but it's a background noise. The latest version of Hadoop 0.20.20x doesn't  require a DataNode restart when a disk fails -take it away and only that 1-3 TB of data fails. When a disk fails -and this is where whoever wrote the paper really misses the point - that 2TB of missing data is still scattered across all other clusters in the rack.

If you have 128 servers in your "modest" cluster, 2TB disks and a block size of 256MB, then there were about 8000 blocks in the disk, which are now scattered across (128-1) servers. Sixtyish blocks per server. With twelve disks per server, that's about five blocks per SATA disk (=2500 MB). Even if -as claimed- the throughput of a single SATA disk is 30MB/s, those five blocks will take under two minutes to be read off disk. I'm not going to follow this chain through to network traffic as you'd also have to factor in the fact that servers are reading network packets from blocks being replicated to it at the same time and saving them to disk too (it should expect 60 blocks incoming), but the key point is this: on a 128 node cluster, the loss of a single 2TB cluster will generate a blip of network traffic, and your cluster will carry on as normal. No need for "24-hour staffing of data center technicians". No need for the ops teams to even get texted when a disk fails. Yes, loss of a full 36TB server is a bit more dramatic, but with 10 GbE it's still manageable. This paper is just trying to scare you based on a failure to understand how block replication is implicit striping of a file across the entire cluster, and they haven't looked at Hadoop's source to see what it does on a failure. Hint: look at BlockManager.java. I have.

To summarise:

The infrastructure handles failures, the apps are designed for it. This is not your old enterprise infrastructure. We have moved out of the enterprise world of caring about every disk that fails, and into a world of statistics. 

I'm going to stop here. I can't be bothered to read the low level details if the high level stuff is so painfully wrong, either through inadequate technical review of the marketing team's world view, or deliberately through a failure to understand HDFS.

[Update 21:57 GMT]. Actually it is weirder than I first thought. This is still HDFS, just running on more expensive hardware. You get the (current) HDFS limitations: no native filesystem mounting, a namenode to care about, security on a par with NFS, without the cost savings of pure-SATA-no-licensing-fees. Instead you have to use RAID everywhere, which not only bumps up your cost of storage, puts you at risk of RAID controller failure and errors in the OS drivers for those controller (hence their strict rules about which Linux releases to trust). If you do follow their recommendations and rely on hardware for data integrity, you've cut down the probability of node-local job execution, so all FUD about replication traffic is now moot as at least 1/3 more of your tasks will be running remote -possibly even with the Fair Scheduler, which waits for a bit to see if a local slot becomes free. What they are doing then is adding some HA hardware underneath a filesystem that is designed to give strong availability out of medium availability hardware. I have seen such a design before, and thought it sucked then too.  Information week says this is a response to EMC, but it looks more like NetApp's strategy to stay relevant, and Cloudera are partnering with them as NetApp offered them money and if it sells into more "enterprise customers" then why not? With the extra hardware costs of NetApp the cloudera licenses will look better value, and clearly both NetApp and their customers are in need of the hand-holding that Cloudera can offer.

I just wish someone from Cloudera had reviewed that solutions paper for technical validity before it got published.

[Artwork: ARYZ, Nelson Street]


Lost in time

I've been following the Occupy London debacle. And it is disaster, a PR mess for the City of London and the Church of England.

One thing about the UK is there are lots of historical things about. The castles, the iron-age hill forts, the straight roads build by the romans, the town names derived from latin. Some retain their grandeur, but have (somewhat) moved on with the times, like here: Oxford, whose curricular has been upgraded for new ideas (calculus), and whose buildings are now tourist attractions as well as a place of learning.
Oxford Colleges

Then there's the City of London. I've cycled through it a few times at the weekend: empty. Boring. Almost nobody lives there. What's interesting is the last fortnight has show how much of a relic of middle ages it is, how much power it has -and, therefore, how much power the business based there have.

The key points are
  • It has special rights w.r.t parliament and "the crown", including lobbying powers in parliament -a parliament which it considers itself mostly exempt from.
  • It's electorate is derived not just from the citizens, but from businesses and organisations based in the City. The citizens have the minority vote.
  • The process for becoming leader of the city is some process that doesn't even pretend to be democratic.
What it means today is this: the part of Britain which contains the headquarters of the most powerful business in the city is effectively self-governed by those business, and independent from the rest of the country.
Night Chill

Which is funny, as the People's Republic of Stokes Croft has taken up that "Passport to Pimlico" theme of an independent state within a city. They just do it as joke. Well, the joke is on them.