Hadoop: going way beyond Google


Lars George has some nice slides up on Hadoop and its futures Hadoop is dead, long live Hadoop! -but it shows a slide that I've seen before and it concerns me

It's on p39 where's a list of papers published by google tying them in to Hadoop projects, implying that all Hadoop is is a rewrite of their work. While I love Google Research papers, they make great reads, we need to move beyond what Google's published work, because that strategy has a number flaws

Time: with the lag from google using to writing about it being 2+ years, plus the time to read and redo the work, that's a 4 year gap. Hardware has moved on, the world may be different. Googles assumptions and requirements need to be reassessed before redoing their work. That's if it works -the original MR paper elided some details needed to get it to actually work [1].

Relevance outside google. Google operate -and will continue to operate- at a scale beyond everyone else. They are doing things at the engineering level -extra checksums on all IPCs- because at their scale the probability of bit errors sneaking past the low-level checksums is tangible. Their Spanner system implements cross-datacentre transactions through the use of GPS to bound time. The only other people doing that in public are telcos who are trying to choreograph time over their network.

Its good that google are far ahead -that does help deal with the time lag of implementing variants of their work. When the first GFS and MR papers came out, the RDBMS vendors may have looked and said "not relevant", now they will have meetings and presentations on "what to do about Hadoop". No, what I'm more worried about is whether there's a risk that things are diverging -and its cloud/VM hosting that's a key one. The public cloud infrastructures: AWS, Azure, Rackspace, show a different model of working -and Netflix have shown how it changes how you view applications. That cloud is a different world view: storage, compute, network are all billable resources.

Hadoop in production works best on physical hardware, to keep costs of storage and networking so low -and because if you try hard you can keep that cluster busy, especially with YARN to run interesting applications. Even so we all use cloud infras too because they are convenient. I have a little one-node Hadoop 2.1 cluster on a linux VM so I can do small-scale Hoya functional tests even when offline. I have an intermittent VM on rackspace so I can test the Swift FileSystem code over there. And if you do look at what Netflix open source, they've embraced that VM-on-demand architecture to scale their clusters up and down on demand.

It's the same in enterprises: as well as the embedded VMWare installed base, OpenStack is getting some traction, and then there is EC2, where dev teams can implement applications under the radar of ops and IT. Naughty, but as convenient as getting a desktop PC was in the 1980s. What does it mean? It means that Hadoop has to work well in such environments, even if virtual disk IO suffers and storage gets more complex.

Other work. Lots of other people have done interesting work. If you look at Tez, it clearly looks at the Dryad from MS Research. But there's also some opportunities to learn the Stratosphere project, that assume a VM infrastructure from the beginning -and build their query plans around that.

Google don't talk about cluster operations.

This is really important. All the stuff about running google's clusters are barely hinted at most of the time. To be fair, neither does any one else, it's "the first rule of the petabyte club".

[Boukharos08] GCL Viewer - A study in improving the understanding of GCL programs. This is a slightly blacked out paper discussing Google's General Config Language, a language with some similarities to SmartFrog: templates, cross-references, enough operations to make it hard to statically resolve, debugging pain. That paper looks at the language -what's not covered is the runtime. What takes those definitions and converts it to deployed applications, applications that may now span datacentres?

[Schwarzkopf13]: Omega: flexible, scalable schedulers for large compute clusters. This discusses SLA-driven scheduling in large clusters, and comes with some some nice slides.

The Omega paper looks at the challenge of scheduling mixed workloads in the same cluster, short-lived analytics queries, longer processing operations: PageRank &c, and latency-sensitive services. Which of course is exactly where we are going with YARN -indeed, Schwarzkopf et al [2] cover YARN, noting that it will need to support more complex resource allocations than just RAM. Which is of course, exactly where we are going. But what the Omega paper doesn't do is provide any easy answers -I don't think there are any, and if I'm mistaken then none of the paper's authors have mentioned it [3].

Comparing YARN to Omega, yes, Omega is ahead. But being ahead is subtly different from being radically new: the challenge of scheduling mixed workloads is the new problem for us all -which is why I'm not only excited to see a YARN paper accepted in a conference, I'm delighted to see mentions of Hoya in it. Because I want the next Google papers on cluster scheduling to talk about Hoya [4].

Even so: scheduling is only part of the problem: Management of a set of applications with O(1) ops scaling is the other [5]. That is a secret that doesn't covered, and while it isn't as exciting as new programming paradigms for distributed programming [6] it is as utterly critical to datacentre-scale systems as those programming paradigms and the execution and storage systems to run them [7].

Where else can the Hadoop stack innovate? I think the top of the stack -driven by application need is key -with the new layers driven by the new applications: the new real-world data sources, the new web applications, the new uses of data. There's also the tools and applications for making use of all this data that's being collected and analysed. That's where a lot of innovation is taking place -but outside of twitter, LinkedIn and Netflix, there's not much in the way of public discussion of them or sharing of the source code. I think companies need to recognise the benefits of opening up your application infrastructure (albeit not the algorithms, datasets or applications), and get it out before other people open up competitive alternatives that reduce the relevance of their own project.

[1] I only wrote that to make use of the word elision.
[2] The elison of the other authors to "et al" is why you need 3+ people on a paper if you are the first author.
[3] This not because I'm utterly unknown to all of them and my emails are being filtered out as if they were 419 scams. I know John from our days at HP Labs.
[4] Even if to call me out for being mistaken and vow never to speak to me.
[5] Where the 1 in the O(1) has a name like Wittenauer.
[6] Or just getting SQL to work across more boxes.
[7] Note that Hortonworks is hiring, we'd love people to come and join us on these problems in the context of Hadoop -and unlike Google you get to share both your work and your code.

[Photo: Sepr on Lower Cheltenham Place, Montpelier]


  1. Interesting take. So, how many different target environments do you see?

    - on-premise hardware
    - VMs in Cloud
    - Google-style large-scale cluster

    Is that it? More? Fewer? And where do you see Hadoop applying? To #1 and #2? Do you see #2 and #3 living side-by-side in the long term?

  2. On-prem splits into physical and on-prem virtual .

    Where do I see things going? I don't know, What do you think?

  3. Hello,

    I think having a gap of 5 years to google's innovations is not so bad.

    The world moves on and in 5-10 years (once Apache projects re-implemented google's publications) other organizations also have the problems, google had 10 years before.

    But you're right, the community outside of Google needs to come up with their own ideas. I actually think that this is also intended by google. So hopefully, we see a transfer of knowledge back into google, from open source!

    Secondly I wanted to point out (as a Stratosphere developer) that the paper you cited is quite old (http://stratosphere.eu/assets/papers/Nephele_09.pdf). Our system is actually a lot more than Nephele (our low-layer network and task management system). We have extended the operator model to include not only Map and Reduce but also Join, Cross and more to come. You can write code with Scala, similar to Scalding ..

    Stratosphere runs on premise on the bare metal and on virtualized infrastructures. (EC2 of course too). We're working currently to run it in YARN.


Comments are usually moderated -sorry.