2013-06-30

Hoya: HBase on YARN

I didn't go to the Hadoop Summit, though it sounds really fun. I am having lots of of fun at a Big Data in Science workshop at Imperial college instead, where problems like "will the code to process my data still work in 50y", as well as the problems that the Square Kilometre Array will have (10x the physics dataset, sources across a desert, datacentre in the desert too)

What did make it over to the Summit is some of my code, the latest of which is Hoya, HBase on YARN. I have been busy coding this for the last four weeks:
Outside office

Having the weather nice enough to work outside is lovely. Sadly, the wifi signal there is awful, which doesn't matter until I need to do maven things, where I have to run inside and hold the laptop vertically beneath the base station two floors above.

Coding Hoya at the office

It's not that readable, but up on my display is the flexing code in Hoya: the bit in the AM that handles a request from the client to add or remove nodes. It's wonderfully minimal code, all it does is compare the (possibly changed) value of worker nodes wanted with the current value, and decides whether to ask the RM for some more nodes (using the predefined memory requirements of a Region Server), or to release nodes -in which case the RM will kill the RS, leaving the HBase master to notice this and handle the lost.

Asking for more nodes leaves the YARN RM to satisfy it, when then calls back to the RM saying "here they are". At which point Hoya sets up a launch request containing references to all the config files and binaries that need to go to the target machine, and a command line that is the hbase command line. There is no need for a Hoya-specific piece of code running on every worker node; YARN does all the work there.

Some other aspects of Hoya for the curious
  • Hoya can take a reference to a pre-installed HBase instance, one installed by management tools such as Ambari, or kickstart installed into all the hosts. Hoya will ignore any template configuration file there, pushing out its own conf/ dir under the transient YARN-managed directories, pointing HBase at it.
  • Although hbase supports multiple masters, Hoya just creates a single master exec'd off the Hoya AM. All but the live HBase master are simply waiting for ZK to give them a chance to go live -they're there for failure recovery. It's not clear we need that, not if YARN restarts the AM for us.
  • Hoya remembers its cluster details in a ~/.hoya/clusters/${clustername} directory, including the HBase data, a snapshot of the configuration, and the JSON file used to specify the cluster.  You can machine-generate the cluster spec if you want.
  • The getClusterStatus() AM API call returns a JSON description of the live cluster, in the same JSON format. It just adds details about every live node in the cluster. It turns out that classic Hadoop RPC has a max string size of <32K, so I'll need to rework that for larger clusters, or switch to protobuf, but the idea is simple: the same JSON structure is used for both the abstract specification of the cluster, and the description of the instantiated cluster. Some former colleagues will be noting that's been done before, to which the answer is "yes, but this is simpler and with a more structured format, as well as no cross-references".
  • I've been evolving the YARN-679 "generic service entry point" for starting both the client and services. This instantiates the service named on the command line, hooking up signal handling to stop it. It then invokes -if present- an interface method, int runService() to run the service, exiting with the given error code. Oh, and it passes down the command line args (after extracting and applying conf file references and in-line definitions from it), before Service.init(Config) is called. This entry point is designed to eliminate all the service-specific entry points, but also provide some in-code access points to -letting you use it to create and run a service from your own code, passing in the command line args as a list/varags. I used that a lot in my tests, but I'm not yet sure the design is right. Evolution and peer-review will fix that.

Developing against YARN in its last few weeks of pre-beta stabilisation was entertaining -there was a lot of change. A big piece of it -YARN-117, was my work; getting it in meant that I could switch from the fork I was using to that branch, after which I was updating hadoop/branch-2; patching my code to fix any compile issues, retesting every morning. Usually: seamless; one day it took me until mid-afternoon for all to work, an auth-related patch on Saturday stopped test clusters working until Monday. Vinod was wonderfully help here, as was Devaraj with testing 50+ node clusters. Finally, on the Tuesday the groovyc support in Maven stopped working for all of us in the EU who caught an incompatible dependency upgrade first. To their credit the groovy dev team responded fast there, not only with a full fix out by the end of the day, but with some rapid suggestions on how to get back to a working build. It's just as you are trying to get something out for a public event, these things always hit your schedule: plan for them.

Also: Hoya is written in a mix of Java (some of the foundational stuff), and Groovy -all tests and the AM & client themselves. This was my second attempt at a Groovy YARN app, "Grumpy" being my first pass back in Spring 2012, during my break between HPLabs and Hortonworks. That code was never finished and too out of date to bother with; I started with the current DistributedShell example and used that -while tracking changes made to there during the pre-beta phase and pulling it over. The good news: a big goal of Hadoop 2.1 is stable protobuf-based YARN protocols, stable classes to help.

Anyway, Hoya works as a PoC, we should be letting out for people to play with soon. As Devaraj has noted: we aren't committed to sticking with Groovy. While some features were useful:  lists, maps, closures, and @CompileStatic finds problems fast as well as speeding up code, it was a bit quirky and I'm not sure it was worth the hassle. For other people about to YARN apps, have a look at Continuuity Weave and see if that simplifies things .


P.S: we are hiring.