I've just been catching up on the RapGenius/Heroku row: that Heroku's load distribution algorithm "random placement" was killing performance as requests were still be routed to blocked worker nodes -dynos.
It's an interesting problem -to me, the failure to explain this to customers is indefensible. It's not an unexpected problem, it's an intrinsic part of the architecture, so talk about it sooner rather than later.
What about the actual "random distribution of incoming requests" architecture?
- Eliminates the need to share queue statistics to the front end nodes, so scales better with the #of front end and back end nodes.
- If all requests take a similarish time to execute (some normal distribution with no long tail), and incoming requests have their own predictable style (say poisson distribution), then scattered scheduling should work well. You've got independent incoming requests being scattered out to a pool of workers that will get their work done. Scattering the work randomly leads to a balanced spreading of the load from each front-end node, if all front-end nodes choose a different set of random nodes at each time t, then the total work on the system gets distributed.
Where does it fail?
- If the front end nodes aren't random enough -they keep picking the same workers and building up queues there.
- If the front end nodes aren't independently random -instead they are picking the same nodes as others, while leaving others idle. Again queues build up.
- If failed worker nodes aren't detected.
- If the time for worker nodes to complete a request has a long-tail built in to some of the requests, so causing some requests to take up much more time, so reducing throughput on that node.
- If overloaded worker nodes aren't detected, and queues build up worse there.
#4 & #5: long-tail requests with requests still hitting the overloaded worker nodes appears to be the problem hitting rapgenius. And once those queues build up, it takes a long time for them to go away: latency sucks.
What could be done here?
- worker nodes to provide queue information to the front-end nodes, they could pick two or three workers and route to the one with the shortest queue. Needs: protocol to poll before submitting; assumes that polling 3 nodes is enough to find one that isn't overloaded.
- tuple-space style "who wants this work?" scattered requests: any of the worker nodes can accept a request based on its workload. This is a pull model rather than a push model. Needs: a T-space that can handle the rate of change.
- Something central to collect workload stats and share it with the front end nodes. Needs: something central, data collection & publishing.
- Workers to publish their queue details to something distributed (T-Space or multicast load details), front-end nodes to collect this and use it in their decision making.
Before everyone goes "that's obvious! Why didn't they think of it!", a cautionary tail.
Way back in 2001, working on one of the early Web Services, XMLRPC requests would come in to some Level7 load balancing thing that would then direct work to the app server back ends processing it. Those machines would talk to other services in the cluster, and return the responses.
Except one day, one of the JVMs failed to find one of the nodes it depended on; some transient DNS failure. Of course, as we all now know, Java caches DNS failures unless you tell it not to: that transient failure was uprated to a permanent failure.
As a result, incoming XMLRPC requests did end up getting serviced very quickly -they were serviced by being turned into failure responses and sent back to the caller.
Which meant that the queue for that app server was significantly shorter than for all the other nodes in the cluster.
Which meant that the L7 router directed more work its way -amplifying a failure from, say 1in 8 requests, to about 1 in 2.
The front end router didn't know that the requests were failing, all it knew was that the observed liveness of the app server "fields requests, returns responses" seemed valid. It was fulfilling its goal of "send work to the machines with the shortest queue".
It was just unfortunate that the machine with the shortest queue wasn't actually live, in the strict "performing useful work" definition of liveness.
That's why a random distribution of requests has some appeal to me: not only is it simpler, it avoids that failure-amplification which we encountered, where fast-failing requests create the illusion of a shorter queue.
As a fun exercise, consider what it would take to automate detection of the failed node. It's not enough for the L7 router to observe that HTTP 500 responses are coming back, because a series of invalid requests could trigger that. You need to submit valid requests to the application from outside, and, if any of them fail, track down which worker node is playing up.
That, for the curious, is precisely why Apache Axis has two features that I added based on my experiences here.
1. a health page, happyaxis.jsp.This gives the system state, returns 200 if happy, 500 if not -for the things upstream to react to- and is designed for humans to parse too.
2. The other feature, which is much less obvious, is that Axis can include the hostname in an AxisFault; with its addHostnameIfNeeded() method, a hostname may be returned to the caller. If enabled, you can push down some of the fault diagnostics to the clients, as people phone you up saying "I am getting failures from 'host4'", rather than them phoning you up saying "One request in eight is failing". Your remote automated liveness tests submitting reference work will now let you track down this problem without waiting for the call -or looking at the request logs to see which box has the problems. In a virtual world, rm that VM.
Anyway, the key points are this.
- Random distribution of workload is both simpler to implement, and somewhat resilient to some failure modes that convert to shorter request queues.
- Adding diagnostics to error responses aids in tracking down problems.
Update: I've thought of another way to identify an overloaded worker node.
- Have each worker node declare its (moving) average service interval from accepting onto the queue to completion.
- Have the front-end servers maintain a moving average of the service interval declared by the last few worker nodes given work.
- The front end servers pick a worker node at random, as before.
- Include a query of the service interval & queue depth into the request forwarding protocol. The front end servers connect to a node in the back, ask it's for it's service interval, then decide whether or not to forward the request.
- The front end node can then implement a local policy for job submission that takes into account the current quoted service interval of the selected worker node in comparison with its moving average of job submissions. If the interval is significantly higher, the front-end node may choose to try another worker node to see if its interval is significantly less than that of the first node probed. If so, work could be sent there -and the moving average updated.
[photo: a giant lizard made out of old CDs on a building in Madrid]