2011-11-26

Alternate Hadoop filesystems

Hillgrove Porter House: Smoke room


My coverage of Netapp's Hadoop story not only generated a blip in network traffic to the site, but a blip in people from NetApp viewing my LinkedIn profile (remember: the mining cuts both ways), including one Val Bercovici -I hope nobody was too offended. That's why I stuck in a disclaimer about my competence*.


I'm not against running MapReduce -or the entire Hadoop stack- against alternate filesystems. There are some good cases where it makes sense. Other filesystems offer security, NFS mounting, the ability to be used by other applications and other features. HDFS is designed to scale well on "commodity" hardware, (where servers containing Xeon E5 series parts with 64+GB RAM, 10GbE and 8-12 SFF HDDs are considered a subset of "commodity"). HDFS makes some key assumptions
  1. The hardware is unreliable --replication addresses this.
  2. Disk and network bandwidth is limited --again, replication addresses this.
  3. Failures are mostly independent, therefore replication is a good strategy for avoiding data loss
  4. Sub-POSIX semantics are all you need --makes handling partitioning easier, locking, etc.
The independence of failure is something I will have to look at some other time, the key thing being it doesn't extend to virtualised "cloud" infrastructures" where all your VMs may be hosted on the same physical machine. A topic for another day.

What I will point to instead is a paper by Xyratex looking at running Hadoop MR jobs over Lustre. I'm not actually going to argue with this paper (much) as it does appear to show substantial performance benefits with the combination of Lustre + Infiniband. They then cost it out and argue that for the same amount of storage, the RAIDed LustreFS capacity is less and that delivers long term power savings as well as purchase price.

I don't know if the performance numbers stand up, but at the very least Lustre does offer other features that HDFS doesn't: it's more general purpose, NFS mountable, supports many small files, and, with the right network, delivers lots of data to anywhere in the cluster. Also Eric Barton of Whamcloud bought us beer at the Hillgrove at our last Hadoop get together.

Cost wise, I look at those number and don't know what to make of them. On page 12 of their paper (#14 in the PDF file) and say that you only need to spend 100x$7500 for the number of nodes in the cluster, so the extra costs of Infiniband networking , and $104K for the storage are justifiable, as the total cluster capital cost comes in at half, and the power budget would be half too, leading to lower running costs. These tables I would question.

They've achieved much of the cost saving by saying "oh look, you only need half as many servers!" True, but that's cut your computation power in half too. You'd have to look hard at the overhead imposed by the datanode work on your jobs to be sure that you really can go down to half as many disks. Oh, and each node would gain from having a bit of local FS storage for logs and overspill data, because it costs more in the big servers, and local storage is fine there.

IBM have stood Hadoop up against GPFS. This makes sense if you have an HPC cluster with a GPFS filesystem nearby,  want an easier programming model than MPI --or the ability to re-use the Hadoop++ layers. GPFS delivers fantastic bandwidth to any node in the cluster, it just makes the cost of storage high. You may want to consider having HDDs in the nodes, using that for the low value bulk data, and using GPFS for output data, commonly read data, and anything where your app does want to seek() a lot. There's nothing in the DFS APIs that stop you having the job input or output FS separate from your fs.default.name, after all.
When running Hadoop in AWS EC2, again the S3 FS is a good ingress/egress FS. Nobody uses it for the intermediate work, but the pay by the kilo storage costs are lower than the blockstore rates, and where you want to keep the cold data.

There's also the work my colleague Johannes did on deploying Hadoop inside the IBRIX filesystem:
This design doesn't try and deploy HDFS above an existing RAIDed storage system; it runs location-aware work inside the filesystem itself. What does that give you?
  • The existing FS features: POSIX, mounting, etc.
  • Different scalability behaviours.
  • Different failure modes: bits of the FS can go offline, and if one server fails another can take over that storage
  • RAID-level replication, not 3X.
  • Much better management tooling.
I don't know what the purchasing costs would be -talk to someone in marketing there- but I do know the technical limits.
  • Less places to run code next to the data, so more network traffic.
  • Rack failures would take data offline completely until it returned. And as we know, rack failures are correlated
  • Classic filesystems are very fussy about OS versions and testing, which may create conflict between the needs of the Hadoop layers and the OS/FS.
Even so: all these designs can make sense, depending on your needs. What I was unimpressed by was trying to bring up HDFS on top of hardware that offered good availability guarantees anyway -because HDFS is designed to expect and recover from failure of medium availability hardware.

(*) For anyone doubts those claims about my competence and the state of the kitchen after I bake anything would have changed their opinions after seeing what happened when I managed to drop a glass pint bottle of milk onto a counter in the middle of the kitchen. I didn't know milk could travel that far.

[Photo: Hillgrove Porter Stores. A fine drinking establishment on the hill above Stokes Croft. Many beers are available, along with cheesy chips]

2 comments:

  1. Very good article. I was responsible for some HPC projects at Sun Microsystems. Just another thing: Lustre (inside Sun) was called "Scratch Filesystem" much because of HPC workload behaviors (MPI I/O).

    ReplyDelete
  2. Just to clarify: Scratch because loosing nodes would eventually make you filesystem break (not a question of if but a question of when).

    ReplyDelete

Comments are usually moderated -sorry.