Steve Loughran: Hadoop in Practice

Recent train journeys to and from London have given me a chance to get the laptop out and read some of the collected PDFs of things I know I should read.

I was given a PDF copy of Hadoop in Practice [Holmes, 2012] on account the fact of I'd intermittently been in the preview program -but I'd not looked at it in any detail until now. The (unexpectedly ) slow train journeys to and from London have been an opportunity to unfold the laptop and read it -and, at home, while I wait for EC2 do respond to whirr requests, to read it to the end -though not in as much detail as it deserves.

The key premises of this book are

You've read one of the general purpose "this is Hadoop" books -either the Definitive Guide or Hadoop in Action.
You want to do more with Hadoop.
You aren't concerned with managing the cluster.
You are concerned about how to integrate a Hadoop cluster with the rest of your organisation.

#3 means that there's nothing here on metrics, logging or low-level things. This is a book for developers and (yes) architects; less the operations people. Even so, the sections on integration with other systems, especially hooking up to log sources and databases that they need to know about.
Although it starts off with a quick overview of Hadoop and MapReduce, internals -such as how HDFS works- are relegated to appendices for the curious. Instead, the first detailed chapter looks at Ingress and Egress, or, so as not to scare readers, "Moving Data in and Out Hadoop", looking mostly at Flume, mentioning Chukwa and Scribe, and then into using Oozie-scheduled MR Jobs to pull data -something in an example in the book.

It doesn't delve into the aspects of this problem you'd need to worry about in production -data rates, the risk that MR pull jobs can either overload the endpoints or, unless they are split up well, can create imbalanced filesystems. Ops problems -or just too much to worry about right now. What it does do is show why a workflow engine like Oozie is useful: to automate the regular work.

It glues the Hadoop ecosystem together. Want to parse XML? grab the XML input reader from Mahout. Want to work with JSON? Twitter's Elephant Bird… etc. In fact the serialization chapter went into the depths XML and JSON parsing -and showed the problems, so justifying the next stage: Protobuf, Avro and Thrift.

There's a chapter on tuning problems which focuses more on code-level issues than hardware; this is where the line between ops & developers gets blurred. I think I'd have approached the problem in a different order, but the tactics are all valid.

Installation-wise, Alex points everyone at a version of CDH without LZO support; he has to talk people through building it. I don't know where Cloudera stand on that, as I know yum -y install hadoop-lzo works for HDP., and is up there with hadoop-native hadoop-pipes hadoop-libhdfs and snappy as RPMs to add (update: see below). I'd have liked to seen bigtop as the centre of the universe, so be more neutral -something to hope for in the second edition

There's a few chapters on "data science" stuff: bloom filters, simple graph operations, R & Hadoop integration. I get the feeling that this section is very handy if you know your statistics and want to do work with a new toolset. The problem I have there is a personal one: I've forgotten too much of what I new about statistics. min, max, mean, Poisson, Gaussian and Weibull distributions;the notion of Markov chains are all concepts I know about -but ask me the equation behind a Poisson distribution and I stare as blankly at the questioner as our pet rabbit does when asked why he's been chewing power cables: there's no comprehension going on behind the eyeballs. I really need something that covers "statistics for people who used to know it vaguely -using R & Pig as the tools". There's a good argument for all developers to know more stats. This book isn't that -it does assume you know your statistics, at least better than I do.

Alex Holmes delves into MRUnit, which is a good way for unit testing individual operations. I tend to do something else: MiniMRCluster -but that one, while more authentic, can push problems onto different threads and so make it harder to identify root causes of problems -or isolate tests. MRUnit doesn't have that flaw, and nor does LocalJobRunner -which also gets coverage. The only thing that grated against me there was that the tests were done in Java -I've been using Groovy as my test language for the whole of 2012, and sheer verbosity of setting up lists in Java, and the crudeness of JUnit's assertions compared to Groovy's assert statements is painful to look at.

For anyone who's never used Groovy, its assert statement takes advantage of the compile-on-demand features of the language. On an assertion failure, the output walks through the entire expression tree, evaluates every part in turn and gives you the complete tree for your debugging pleasure. You can write one all-encompassing assertion, rather than break down each part of a large query into various assertNotNull, assertTrue, assertEquals calls -and if the single assert fails, there should be enough information for you to track down the cause. That's why I like testing in Groovy, irrespective of whether or not your production code is in Java.

Other points: the ebook comes with your email address at the bottom, but no epub-esque security. This works on your Linux workstation as well as whatever tablet you choose to own -and relies on publicity & guilt to stop sharing. Which is probably a good strategy. That eBook comes with a feature I've never seen before: the page numbers in the contents match exactly the page numbers in the book -there must be some Framemaker magic that tells Preview &c the offset to apply after the user hits the "go to page" button.

Summary: this isn't book for newbies -precisely because it delves into Applied Hadoop. Even so, it's something you ought to have to hand, just so you aren't one of the people posting questions to user@hadoop that everyone else stares and generally refuses to answer., the "hello, I have got a pseudo-distributed cluster that cannot find localhost, here is the screenshot of the DOS console, please help!!!" -while forgetting to even include the screenshot of their hadoop.bat command line failing as they've forgotten to do something foundational like install Java.

Everyone but @castagna will learn something new -in fact maybe even him, because he needs something to read on test runs and trains to London (which is where I'm writing this, somewhere between Reading and London Paddington)

Update: Eric Sammer says of the LZO thing "hadoop-lzo in cdh, it's because of license concerns that we don't distrib."

Steve Loughran

2012-10-17

Hadoop in Practice - "Applied Hadoop"

1 comment: