A note on distributed computing diagnostics

Sepr @ See No Evil

I've just stuck in my first mildly interesting co-authored Hadoop patch for a while, HADOOP-7466: "Add a standard handler for socket connection problems which improves diagnostics"

It tries to address two problems

Inadequate Diagnostics in the Java Runtime

Despite Java being "the language of the Internet" or whatever Sun used to call it, when you get any kind of networking problem (Connection Refused, No Route to Host, Unknown Host, Socket Timed Out), the standard Java exception messages don't bother to tell you which host isn't responding, what port is refusing connections or anything else useful. In a room with 2000 machines, it's not that useful to know that one of them can't talk to another. You need to know which machine is having problems, what other machine it is trying to talk to, and whether its the HDFS level or something above. But no, the exception text never gets any better, whoever wrote them didn't read Waldo's A Note on Distributed Computing and think that if two machines are near each other nothing can possibly go wrong.

Whatever they were thinking, if they tried to submit exception messages like that to the Hadoop codebase today, the review process would probably bounce them back with a "make this vaguely useful". The patch tries to fix this by taking the exception and the (hostname, port) of the source and destination (if known), and then includes these details in the exception text. This helps people like me know what's gone wrong with our configuration and/or network.

Inadequate understanding of the fundamental network error messages

This is something I despair of. There are people out there that haven't done enough homework to know what a ConnectionRefused exception means, and ask for help when they see it. Again, again and again. Same for all the other common error messages.

The people who are trying to set up Hadoop clusters who don't yet know what these error messages are in way out of their depth. That should be an appendix to Waldo's paper: the many layers of historical code underneath are not transparent; it helps to have read Tanenbaum's "Computer Networking" book, it helps to spend some time writing code at the socket layer, just to understand what goes wrong at that level. Trying to download the Hadoop artifacts and then push it out to a small set of machines without this basic knowledge is dooming these people to days of confusion, which inevitably propagates to the mailing lists and bug trackers. Usually someone posts a stack trace to the -user and -dev lists, then starts repeating it every hour until someone answers; the total cost of wasted time is surprisingly high.

The patch, therefore, also add references to the wiki pages, for ConnectionRefused, UnknownHost, NoRouteToHost, BindExceptionSocketTimeout. All of which list some possible causes, and some tips on debugging the problem. And also say : this is your network, your configuration, you are going to have to fix it yourself.

Will it stop people asking for help? Unlikely. But it may get them learn what the messages mean, and why it is a problem on their side. Because it's not my problem.

[Artwork by Sepr]


H work

This system needs a push over the edge

On my todo list, then is to catch up with what's going on Hadoop, get some of my minor issues checked in and get involved with the fun stuff in Trunk. That includes the 0.23 YARN stuff, but I'm also starting to think there are some data integrity risks that ought be addressed.

First step: building everything. I am particularly excited to see that Hadoop-trunk now requires me to download and build protocol buffers [README]. I shall be updating the relevant wiki pages so that I can remember what to do on the other machines.


301 Moved Permanently

I've moved my blog from 1060.org here. Why? Well, the team at 1060research have pushed out a new release of their NetKernel product, on which the blog was running, and if I wanted to retain the URLs I'd have to upgrade the code myself.  Being lazy and all, I opted not to.

What have I been up to since going offline
  1. On twitter @steveloughran. Idle chatter.
  2. Finishing up some major project at work that has kept be busy for the past 12-18 months. I am feeling more relaxed now. 
  3. Coding in Groovy. It's like Java only better, and trivial to switch between the too. There's great IDE support in IntelliJ IDEA too.
  4. Doing some proper Computer Science stuff, as opposed to Software Engineering.
  5. Paper Reading. This may seem dull, but there is a lot of interesting stuff out there. If you spend too much time knee-deep in various projects' codebases you get sucked into various issues (log4j configuration etc), and out of touch with higher level problems. 
  6. Holiday on the south coast of England. Not far; lazy. I lost a camera, which is a pity, but I've replaced it already.
My most recent set of readings was all the big datacentre papers on DRAM and HDD failures. A good way to revise on concepts like Poisson Distributions, Gauss Distributions, Weibull Distributions (which I never knew of before), and lots of other math-hard problems. It's shocking how much maths I have forgotten; I keep seeing these bits on the paper where they do integration or differentiation and say "clearly then" and the words mean nothing to me. I will have to revise some fundamentals.

With the work I've been doing wrapping up, I'm hoping to get more involved in Hadoop and Hadoop related work. I don't have a schedule for that -I'm just reading now.