Nobody ever got fired for using Hadoop on a cluster

Over a weekend in London I enjoyed reading a recent Microsoft Research paper from Rowstron et al., Nobody ever got fired for using Hadoop on a cluster.

The key points they make are
  1. A couple of hundred GB of DRAM costs less than a new server
  2. Public stats show that most Hadoop analysis jobs use only a few tens of GB.
  3. If you can load all the data into DRAM on a single server, you can do far more sophisticated and stateful algorithms than MapReduce
They then go on to demonstrate this with some algorithms and examples examples that show that you can do analysis in RAM way, way faster than by streaming HDD data past many CPUs.
Avebury Stone Circle

This paper makes a good point. Stateless MapReduce isn't ideal way to work with things, which is why counters (cause JT scale problems), per-task state (dangerous but convenient) and multiple iterations (purer; inefficient) come out to play. It's why graph stuff is good.

Think for a moment though, what Hadoop HDFS+MR delivers.
  • Very low cost storage of TB to PB of data -letting you keep all the historical data for analysis.
  • I/O bandwidth that scales with the #of HDDs.
  • An algorithm that, being isolated and stateless, is embarassingly parallel
  • A buffered execution process that permits recovery from servers that fail
  • An execution engine that can distribute work across tens, hundreds or even thousands of machines.
HDFS delivers the storage, MapReduce provides a way to get at the data. As Rowstron and colleagues note, for "small" datasets, datasets that fit into RAM, it's not always the best approach. Furthermore, that falling cost of DRAM means that you can start predicting cost/GB of RAM in future, and should start thinking "what can I do with 256GB RAM on a single server?" The paper asks the question: Why then, Hadoop? Well, one reason the paper authors note themselves:
The load time is significant, representing 20–25% of the total execution time. This suggests that as we scale up single servers by increasing memory sizes and core counts, we will need to simultaneously scale up I/O bandwidth to avoid this becoming a bottleneck.
A server with 16 SFF HDDs can give you 16 TB of storage today; 32 TB in the future, probably matched by 32 cores at that time. The IO bandwidth, even with those 16 disks, will be a tenth of what you get from ten such servers. It's the IO bandwidth that was a driver for MapReduce -as the original Google paper points out.  Observing in a 2012 paper that IO bandwidth was lagging DRAM isn't new -that's a complaint going back to the late 1980s.

If you want great IO bandwidth from HDDs, you need lots of them in parallel. RAID-5 filesystems with striped storage deliver this at a price; HDFS delivers it at a tangibly lower price. As the cost of SDDs falls, when they get integrated into the motherboards, you'll get something with better bandwidth and latency numbers (I'm ignoring wear levelling here, and hoping that at the right price point SSD could be used for cold data as well as warm data). SSD at the price/GB of today's HDDs would let you store hundreds of TB in servers, transform the power budget of a Hadoop cluster, and make random access much less expensive. That could be a big change.

Even with SSDs, you need lots in parallel to get the bandwidth -more than a single server's storage capacity. If, in, say 5-10 years you could get away with a "small" cluster with a few tens of machines, ideally SSD storage, and lots of DRAM per server, you'd get an interesting setup. A small enough set of machines that a failure would be significantly less likely, changing the policy needed to handle failures. Less demand for stateless operations and use of persistent storage to propagate results; more value in streaming across stages. Less bandwidth problems, especially with multiple 10GBe links to every node. Lots of DRAM and CPUs.

This would be a nice "Hadoop 3.x" cluster.

Use an SSD-aware descendent of HDFS that was wear-leveling aware, maybe mix HDD for cold storage, SSD for newly added data. Execute work that used RAM not just for the computations, but for the storage of results. Maybe use some of that RAM for replicated storage -as work is finished, forward it to other nodes that just keep it in RAM for the next stages in a batch mode, stream directly to them for other uses. You'd gain storage capacity and bandwidth that a single server will always lag compared to a set of servers, while being able to run algorithms that can be more stateful.

In this world, for single rack clusters, you'd care less about disk locality (the network can handle that better than before), and more about which tier of storage the data was. Which is what Ananthanarayanan et al., argued in Disk-Locality in Datacenter Computing Considered Irrelevant. You could view that rack as storage system with varying degrees of latency to files, request server capacity 64 GB at a time. The latter, of course, is what the YARN RM lets you do, though currently its memory slots are measured in smaller numbers like 4-8GB.

YARN could work as a scheduler here -with RAM-centric algorithms running in it. What you'd have to do is ensure that most of the blocks of a file is stored in the same rack (trivial in a single-rack system), so that all bandwidth consumed is rack-local. Then bring up the job on a single machine, ask for the data and hope that network bandwidth coming off each node is adequate for the traffic generated by this job and all the others. Because at load time, there will be a lot of network IO. Provided that data load is only a small fraction of the analysis -the up front load time- this may be possible.

It seems to me that the best way to keep that network bandwidth down would be to store the data in RAM for a series of queries. You'd bring up something that acted as an in-RAM cache of the chosen dataset(s), then run multiple operations against it -either sequential or, better yet, parallelised across the many cores in that server. Because there'd be no seek time penalty, you can do work in the way the algorithm wants, not sequentially. Because it's in DRAM, different threads could work across the entire set simultaneously.

Yes, you could have fun here.

People have already recognised that trend towards hundreds of GB of RAM and tens of cores -the graph algorithms are being written for that. Yet they also benefit from having more than one server, as that helps them scale beyond a single server.

Being able to host a single-machine node with 200GB of data is something HDFS+YARN would support, but so would GPFS+Platform -so where's the compelling reason for Hadoop? The answer has to be that which today justifies running Giraph or Hama in the cluster -for specific analyses that can be performed with data generated by other Hadoop code, and so that the output can be processed downstream within the Hadoop layers.

For the MS Research example, you'd run an initial MR job to get the subsets of the data you want to work with out of the larger archives, then feed that data into the allocated RAM-analysis node. If that data could be streamed in during the reduce phase, network bandwidth would be less, though restart costs higher. The analysis node could be used for specific algorithms that know that access to any part of the local data is at RAM speeds, but that any other HDFS input file is available if absolutely necessary. HDFS and HBase could both be output points.

That's the way to view this world: not an either/or situation, which is where a lot of the Anti-MapReduce stories seem to start off, but with the question "what can you do here?", where the here is "a cluster with petabytes of data able to analyse it with MapReduce and Graph infrastructures -and able to host other analysis algorithms too?". Especially as once that cluster uses YARN to manage resources, you can run new code without having to buy new machines or hide the logic in long lived Map tasks.

The challenge then becomes, not "how to process all the data loaded into RAM on a server", but, "how to work with data that is stored in RAM across a small set of servers?" -the algorithm problem. At the infrastructure level, "how to effectively place data and schedule work for such algorithms, especially if the no of servers/job and duration is such that failures can be handled by checkpointing rather than restart that shard of the job". And "we can rent extra CPU time off the IaaS layer"

Finally, for people building up clusters, that perennial question crops up: many small servers vs fewer larger servers? I actually think the larger servers have a good story, provide you stick to servers with affordable ASPs and power budgets. More chance of local data, more flexible resource allocation (you could run something that used 128GB of RAM and 8 cores), and less machines to worry about. The cost: storage capacity and bandwidth; the failure of a high-storage-capacity node generates more traffic. Oh, and you need to worry about network load more too, but less machines reduces the cost of faster switches.

[Photo: the neolithic Avebury stone circle, midway between Bristol and London]

[updated: reread the text and cleaned up]

No comments:

Post a Comment

Comments are usually moderated -sorry.