Steve Loughran: 2014-08

Stonebreaker is at it again, publishing another denunciation of the Hadoop stack, with the main critique being its reimplentations of old stuff at google. Well, I've argued in the past that we should be thinking past google, but not for the reason Stonebraker picks up.

My issue is that hardware and systems are changing and that we should be designing for those future designs: mixed-moving-to-total SSD, many-core NUMA, faster networking, v18n as deployment. Which is what people are working on in different parts of Hadoop-land, I'm pleased to say.

But yes, we do like our google papers —while incomplete, google still goes to a lot of effort to describe what they have built. We don't get to read much about, say, Amazon's infrastructure, by comparison. Snowden seems to have some slideware on large government projects, but they don't go into the implementation details. And given the goal the goal of one of the projects was to sample Yahoo! chat frames then we can take it as a given that those apps are pretty dated too. That leaves: Microsoft, Yahoo! Labs, and what LinkedIn and Netflix are doing —the latter all appearing in source form and integrating with Hadoop from day one. No need to read and implement when you can just git clone and build locally.

There's also the very old work, and if everyone is going to list their favourite papers —I'm going to look at some other companies: DEC, Xerox, Sun and Apollo.

What the people there did then was profound at the time, and much of it forms the foundation of what we are building today. Some aspects of the work have fallen by the wayside —something we need to recognise, and consider whether that happened because it was ultimately perceived as a wrongness on the face of the planet (CORBA), or because some less-powerful-yet-adequate interim solution became incredibly successful. We also need to look at what we have inherited from that era, whether the assumptions and goals of that period hold today, and consider the implications of those decisions as applied to today's distributed systems.

Starting with Birrell and Nelson, 1984, Implementing Remote Procedure calls.

This is it: the paper that said "here is how to make calling a remote machine look like calling some subroutine on a local system".

The primary purpose of our RPC project was to make distributed computation easy. Previously, it was observed within our research community that the construction of communicating programs was a difficult task, undertaken only by members of a select group of communication experts.

Prior to this, people got to write their own communication mechanism, whatever it was. Presumably some low-level socket-ish link between two processes and hand-marshalled data.

After Birrell's paper, RPC became the way. It didn't just explain the idea, it showed the core implementation architecture: stubs on the client, a service "exporting" a service interface by way of a server-side stub and the RPC engine to receive requests, unmarshall them and hand them off to the server code.

It also did service location with Xerox Grapevine, in which services were located by type and instance. This allowed instances to be moved around, and for sets of services to be enumerated. It also removed any hard-coding of service to machine, something that classic DNS has tended to reinforce. Yes, we have a global registry, but now we have to hard-code services to machines and ports "namenode1:50070", then try to play games with floating IP addresses which can be cached by apps (JVMs by default), or pass lists of the set of all possible hosts down to the clients...tricks we have to do because DNS is all we are left with.

Ignoring that, RPC has become one of the core ways to communicate between machines. For people saying "REST!", if the client-side implementation is making what appear to be blocking procedure calls, I'm viewing it as a descendent of RPC...same if you are using JAX-RS to implement a back end that maps down to a blocking method call. That notion of "avoid a state machine by implementing state implicitly in the stack of the caller and its PC register" is there. And as Lamport makes clear: a computer is a state machine.

Hence: RPC is the dominant code-side metaphor for client and server applications to talk to each other, irrespective of the inter-machine protocols.

There's a couple of others that spring to mind:

message passing. Message passing comes into and falls out of fashion. What is an Enterprise Service Bus but a very large message delivery system? And as you look at Erlang-based services, messages are a core design. Then there's Akka: message based passing within the application, which is starting to appeal to me as an architecture for complex applications.

shared state. entries in a distributed filesystem, zookeeper znodes, tuple-spaces and even RDBMs tables cover this. Programs talk indirectly by publishing something to a (possibly persistent) location, recipients poll or subscribe to changes.

We need to look at those more.

Meanwhile, returning to the RPC paper, another bit of the abstract merits attention

RPC will, we hope, remove unnecessary difficulties, leaving only the fundamental difficulties of building distributed systems: timing, independent failure of components, and the coexistence of independent execution environments.

As the authors call out, all you need do do is handle "things happening in parallel" and "things going wrong". They knew these issues existed from the outset, yet the way RPC makes the fact that you are calling stuff remotely "transparent", it's easy for us developers to forget about the failure modes and the delays.

Which is the wonderful and terrible thing about RPC calls: they look just like local calls until things go wrong, and then they either fail unexpectedly, or block for an indeterminate period. Which, if you are not careful, propagates.

Case in point, HDFS-6803, "Documenting DFSClient#DFSInputStream expectations reading and preading in concurrent context". That's seems such a tiny little detail, wondering if "should the positioned read operation readFully(long position, byte[] buffer, int offset, int length) be thread safe and, if the implementation is up to it, reentrant?".

To which my current view is yes, but we shouldn't make it invisible to others. Why? If we try to hide that a seek/read/seek sequence in progress, you either have to jump through hoops caching and serving up a previous position, or make getPos() synchronize, such as proposed in HDFS-6813. Which effectively means that getPos() has been elevated from a lightweight variable fetch, to something that can block until an RPC call sequence in a different thread has completed, successfully or otherwise. And that's a very, very different world. People calling that getPos() operation may not expect an in-progress positioned read to surface, but nor do they expect it to take minutes —which it can now do.

And where do we have such potentially-blocking RPC calls inside synchronized method calls? All over the place. At least in my code. Because most of the time RPC calls work -they work so well we've gotten complacent.

(currently testing scheduled-window-reset container failure counting)
(photo: looking ahead on Col de La Machine, Vercours Massif — one of the most significantly exposed roads I've ever encountered)

I am back from a 2+ week vacance in the French Alps —the first since 2009. More specifically, the first since the DVLA took away my driving license on the basis that I was medically unfit to drive. They now consider me no more dangerous than anyone else, so we could drive down there. My main issue with driving now is that I've forgotten things: motorway awareness, the width of my car (very important in bristol), what it feels like to go round corners fast enough for car to indicate through the steering wheel that it doesn't want to —and worst of all, how to parallel park into small gaps.

Two weeks in the French Alps, along with 2-day approaches/retreats is a good revision there, giving me the learning matrix of

	Day	Night	Bad Weather
Autoroute	√	√	√
French village	√	√	√
French urban (excluding Paris)	√	√
Windy country roads	√		√
Alpine passes	√		√

There's a few left, and I didn't try any overtakes, but otherwise that's reasonable coverage of the configuration space.

While there: MTB work, some on-road work including a 100 mile epic, a bit of basic Alpine hut hiking/scrambling —as well as staying with friends, eating too many cheese-based products (tartiflettes &c), consuming french beer and cidre, catching the Tour de France Chamrousse Stage and generally having fun.

I got to review some of the places last visited in 2009:

Where I'm pleased to see that I don't seem significantly rounder

The main difference this time was that someone was rinsing their clothes in this village fountain half-way up Col du Barrioz, so we had to sprint off before we got harassed, crossing the pass and sheltering from some rain in a bar/tabac where I opted to not drink that day

a policy which the camera's EXIF data implies lasted exactly nine minutes

Anyway: an excellent holiday —it was great to be back in the Alps.

Steve Loughran

2014-08-11

Distributed Computing Papers #1: Implementing Remote Procedure Calls

2014-08-08

Modern Alpinism