HDP released -and what it took to get there

This week the Hortonworks Data Platform has launched -to ship on Friday. I've been working with it for a while, and I'm really excited about the launch and seeing that it's getting to real people's hands.

Most of the coverage will be about features, the management console, other things for enterprise users -like availability. I'm going to look at another aspect of the product: the time and effort put in by the QA people.
Cricket before the bridge
The Hadoop stack is open source, and it's designed to run on everything from a few machines to very large scale clusters. Some of the design decisions only make sense at the large scale. For example, tasks are pushed out to task trackers when they check in for heartbeats or task completion events -because in a large cluster, 1000+ servers checking in generate enough traffic that pushing work out this way is the only way to stop the JobTracker overloading. This design can lead to a worse startup time in very small clusters (tip: decrease the heartbeat interval), but it's something that had to be done for the scale-up. There's all the bugs that only show up at scale -like the need to have datanodes report in on a different port number than DFS client operations, as that stops a namenode overloaded with FS requests from thinking more datanodes are down, leading to a cascade failure of the cluster. Scale problems for the big clusters -which means that nobody else will hit the same problems. And, with the big cluster running the same code as the smaller ones -all the smaller clusters -the majority- get to benefit from the ongoing performance enhancements, scheduler improvements and other features which the teams working with the large clusters can develop.

To use Hadoop in a big cluster, you need to be sure it works at scale -and with the performance to back up the scale. That's a test problem which very few organisations can address. This marks a big difference between these datacentre-scale platforms and the rest of the Apache portfolio. I can build and test my own version of Hadoop to run in small clusters (indeed, I've done just that), but it's not something that anyone can trust to work at any scale.

What's even more important is that it's not just HDFS and the MR layer, it's the entire stack. This is what the QA people have pulled off -testing everything up the stack, at scale. Then, the problems they've reported have been turned into JIRAs, then through the work of the Hortonworks developers and many others in the Hadoop dev community, patches; patches that are now in the Apache codebase.

What the QA team have done then, is help ensure that the Apache releases are qualified to work as a coherent stack even at a scale most people won't encounter. Which mean that the rest of us can be confident that it will work for us too, whether its the HDP product, or just the plain Apache artifacts you pull down in your build process(*).

Which is why, despite all the great stuff that's gone into the release, including stuff of mine that I'm really proud to have covered, I think its the work of the QA people, testing the code on the big 1000+ node cluster, that deserves credit and recognition. They are the people that have given us a stack that works.

(*) I know, there's a small catch there -different configurations- which I'll look at some other time.

[Photo: Saturday afternoon cricket in front of the Clifton Suspension Bridge]

No comments:

Post a Comment

Comments are usually moderated -sorry.