Reading the paper The φ Accrual Failure Detector has made me realise something that I should have recognised before: blacklisting errant nodes it too harsh -we should be assigning a score to them and then ordering them based on perceived reliability, rather than a simple reliable/unreliable flag.
In particular: the smaller the cluster, the more you have to make do with unreliable nodes. It doesn't matter if your car is unreliable, if it is all you have. You will use it, even if it means you end up trying to tape up an exhaust in a car park in Snowdonia, holding the part in place with a lead acid battery mistakenly placed on its side.
Similarly, on a 3-node cluster, if you want three region servers on different nodes, you have to accept that they all get in, even if sometimes unreliable.
This changes how you view cluster failures. We should track the total failures over time, and some weighted moving average of recent failures -the latter to give us a score of unreliability, giving us a reliability score of 1-reliability, assuming I can normalise unreliability to a floating point value in the range 0-1.
When specifically requesting nodes, we only ask for those with a recent reliability over a threshold; when we get them back we first sort for reliability and try to allocate all role instances to the most reliable nodes (sometimes YARN gives you more allocations than you asked for). We may have some allocations on nodes > the reliability threshold.
That threshold will depend on cluster size -we need to tune that based on the cluster size provided by the RM (issue: does it return current cluster size or maximum cluster size).
What to do with allocations above the threshold?
- discard them, ask for a new instance immediately: high risk of receiving the old one again
- discard them, wait, then ask for a new instance: lower risk.
- ask for a new instance before discarding the old one the soonest of (when the new allocation comes in, some time period after making the request). This probably has the lowest risk precisely because if there is capacity in the cluster we can't get that old container, we'll get a new one on an arbitrary node. If there isn't capacity, when we release the container some time period after making the request, we get it back again. That delayed release is critical to ensuring we get something back if there is no space.
If we do start off giving all nodes a reliability of under 100%, then we can even distinguish "unknown" from "known good" and "known unreliable". This gives applications a power they don't have today -a way to not trust the as-yet-unknown parts of a cluster
If using this for HDD monitoring, I'd certainly want to consider brand new disks as less than 100% reliable at first, and try to avoid storing data in >1 drive below a specific reliability threshold, though that just makes block placement even more complex
I like this design --I just the need the relevant equations