Someone -and I won't name them- commented on my proposal for a Hadoop Summit EU talk, Secrets of YARN development: "I am reading YARN source code for the last few days now and am curious to get your thoughts on this topic - as I think HOYA is a bad example (sorry!) and even the DistributedShell is not making any sense."
My response: I don't believe that DShell is a good reference architecture for a YARN app. It sticks all the logic for the AM into the service class itself, doesn't do much on failures, avoids the whole topics of RPC and security. It introduces the concepts but if you start with it and evolve it, you end up with a messy codebase that is hard to test -and you are left delving into the MR code to work out how to deal with YARN RM security tokens, RPC service setup, and other details that you'd need to know in production
- Embraces the service model as the glue to building a more complex application. Shows my SmartFrog experience in building workflows and apps from service aggregation.
- Completely splits the model of the YARN app from the YARN-integration layer, producing a model-controller design. Where the model can be tested independently of YARN itself.
- Provides a mock YARN runtime to test some aspects of the system --failures, placement history, best-effort placement-history reload after unplanned AM failures --and lays the way for simulating the model can handle 1000+ clusters.
- Contains a test suite that even kills HBase masters and Region Servers to verify that the system recovers.
- Implements the secure RPC stuff that Dshell doesn't and which isn't documented anywhere that I could find.
- Bundles itself up into a tarball with a launcher script -it does not rely on Hadoop or YARN being installed on the client machine.
Where it is weak is
- It's now got too sophisticated for an intro to YARN.
- I made the mistake of using protobuf for RPC which is needless complexity and pain. Unless you really, really want interop and waste a couple of days implementing marshalling code I'd stick to the classic Hadoop RPC. Or look at Thrift.
- I need to revisit and cleanup of bits of the client side provider/template setup logic.
- We need to implement anti-affinity by rejecting multiple assignments to the same host for non-affine roles.
- It's pure AM-side, starting HBase or Accumulo on the remote containers, but doesn't try hooking the containers up to the AM for any kind of IPC.
- We need to improve its failure handling with more exponential backoff, moving average blacklisting and some other details. This is really fascinating, and as Andrew Purtell pointed me at phi-accrual failure detection, is clearly an opportunity to some interesting work.
I also filed a JIRA "rework DShell to be a good reference design", which means implement the MVC split and add a secure RPC service API to cover that topic.
Otherwise: have a look at the twill project in incubation. If someone is going to start writing a YARN app, I'd say: start there.