Toxic Ideas

Whenever I walk down to the Pop-up Patisserie on Stokes Croft to buy my Pain-aux-Amandes for drinking with my morning Illy Coffee, I cross a lovely little square on the King's Down, above Nine Tree Hill,


This lovely square is actually a war grave. A thirteen gun fort was built there during the English Civil War, a fort captured by the Royalists in 1643. in 1645 it fell to Cromwell's New Model Army —and everyone in the fort killed.

Looking at the genteel civility of these georgian houses built 150 years after the war, it's hard to conceive of what happened here, but it did.

That is: by the end of the Civil War, people in the country side were prepared to kill fellow citizens "without quarter". No expansionist battle for territory between cultures, language speakers, or the like. Simply, initially, differing beliefs —going from abstract details to executions in under five years.

Perhaps the ideas were so fundamentally incompatible that yes, killing the holders of the conflicting beliefs was the only possible way to have peace in the land —the civil war being thus the conflict between ideas instantiated. Or what was an initially a metaphorical battle over the notion of sovereignty at the level of Parliament and the King, created a conflict which resulted in a war where soldiers on the ground were told what they were fighting for in terms they could understand and kill for —even if bore no relation to those root cause.

Which brings me to the present day.

UKIP are not the National Socialist German Workers’ Party. Farage doesn't appear to have any leanings towards fasicm(*). He's just a racist populist who complains about being picked on when everyone denounces his poster which was "unfortunate" to come out the day an elected MP —something Farage has never achieved, despite many attempts— was killed by someone far to the right of UKIP. That alleged killer is unlikely to have been directly radicalized by the runup to the #Brexit vote —though one must suspect, he wouldn't have been voting for remain.

Because what the Leave campaigns have come down to, in the end, is anti-European racism: it is for Brexit what anti-Scots advertising was the conservative party in the 2015 UK elections. Except the Brexit message is coming at the end of months and months —years, even of press hate directed at "migrants" and "refugees", "swarms of them". That's where the hate has been nurtured: not in a few weeks in the summer of 2016, nor even in a poster which was appalling enough before what took place later that day. It's in the front pages of the Daily Mail and the Daily Express every morning. When I nip round to my local shop to buy a baguette, I have to struggle to avoid making eye contact with some half-page-height headline text above a photo, text invariably starting with "MIGRANTS", or, failing that, MUSLIMS

That's what's closest to Nazism: the relentless selling of hate against a group of people. Here though, possibly more for profit than any ideology.

Yet that hate they sell seeps in the minds of people —that's what would have radicalized the alleged killer. The Daily Mail.

And what the Vote Leave campaign has done is legitimised that hate: made it OK. And irrespective of the outcome of the referendum, it's going to remain.

If Britain votes to leave, post-whatever economic shocks hit our economy and exchange rate, while the negotiations go on with the EU about how we should still be friends, all those people worked up into a hate of the foreigner will be demanding they get what they were promised. It's going to be fundamentally unpleasant for anyone considered "not one of us". I don't know whether me and my family meet that criteria: I do know I'd be exercising my ancestral right to an Irish passport later on in June, not just for me —but for my son. Because I'm don't know if a joint UK/US citizen of Irish/Scottish/Gujarati descent would be on the UKIP approved list —or if he wants to live in a country where UKIP has influence. And while that US passport gives him one exit option —things don't look good there either.

If, as I hope happens, Britain votes to stay in the EU, all that stirred up hate will remain.

Assuming it's a close vote, the losers aren't going to say "oh, well, never mind" —there's going to be a lot of angry resentment. And the "remain" coalition have helped stoke up that anger and resentment, with their dire predictions of doom and devastation. Disappointed Leave voters and activists are going to feel their victory was stolen, and are unlikely to go away. Perhaps hopefully, Farage himself will —and then be forever remembered for "that poster"; his rivers of bloodˆ epitath.

Those in the Conservative Party may form a majority of MPs, as well as members —they aren't going to away: expect a coup attempt later on in the year, with Boris waiting to graciously accept the crown.

And the rest of the nation? Those papers are going to keep churning out their hate, now with "if only they'd listen", and "we warned you" accompanying each article of lies, exaggeration and hate. People are going to keep reading them —and keep getting angry.

Whatever happens, it's not going to be pleasant.

(*) Do note, however, that his mock attempt to resign as leader of the party and it's refusal to be accepted by the party does mimic that of Hitler's threat to resign in 1921 from the then-nascent NZ party. (The Rise and Fall of the Third Reich, Shirer, 1961, Chapter 2).

[Photo, Freemantle Square, Kingsdown, Site of Prior's Fort, 1642-1645; a small plaque in the centre of the green remembers the dead. Nine-tree hill got its name at the time, as did the nearby Cromwell Road].


No, you didn't want that stack trace

Tail of a test run.

Test failure

What am I doing today? End to end performance testing of the new S3A implementation. Or not.

2016-06-16 18:17:32,339 INFO  s3.S3aIOSuite (Logging.scala:logInfo(58)) -
 Loading configuration from ../cloud.xml
  java.lang.RuntimeException: Unable to load a Suite class that was discovered in the
   runpath: org.apache.spark.cloud.s3.S3aIOSuite
  at org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:84)
  at org.scalatest.tools.DiscoverySuite$$anonfun$1.apply(DiscoverySuite.scala:38)
  at org.scalatest.tools.DiscoverySuite$$anonfun$1.apply(DiscoverySuite.scala:37)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

So, a test case wouldn't load. No details. Well, let's paste in the stack trace and see if the IDE will tell me where things went wrong.

catch {
  case e: Exception => {
    // TODO: Maybe include the e.getClass.getName
    // and the message for e in the message
    // cannotLoadDiscoveredSuite,
    // because Jess had the problem
    // That gradle cut off the stack trace so she
    // couldn't see the cause.
    val msg = Resources("cannotLoadDiscoveredSuite", suiteClassName)
    throw new RuntimeException(msg, e)   ***HERE ***

Ah. The inner cause has been lost, whoever wrote the test runner knew this as someone called "Jess" complained. And the reaction was not to address it, but just add a TODO todo note.

I generally automatically -1 any patch whose test cases lose exception data. This test runner isn't quite doing that, but it is cutting the message from the top of the list, relying on the test reporter to display the whole list. Which, clearly, the scalatest reporter is not doing.

Having written a distributed JUnit runner in the distant path, I understand a key issue with test runners: they're hard to test. You essentially have to bootstrap your own testing infrastructure, then generate the failure conditions you see from tests. Which does mean: you get to stare at the stack traces. I think what we are seeing here is evidence that whoever is developing scalatest doesn't use maven.

This also shows what's wrong with TODO notes: they're technical debt you forget about.

At least if you keep adding more and more things to your issue tracker, you've got a nice shared graph of filed vs completed. Whose project tracks the number of TODO line items in a source tree?

(Update: found the problem. The test in question had a base class with an after{} cleanup clause, as did a mixed in trait. Whatever error message scalatest raises at that point, well, it doesn't make it through the testrunner unscathed)


Fear of Dependencies

There are some things to be scared of; some things to view as a challenge and embrace anyway.

Peter hasn't done this climb before. He thinks it will be fun. Peter is wrong

Here, Hardknott Pass falls into the challenge category —at least in summertime. You know you'll get up, the only question is "cycling" or "walking".

Hardknott in Winter is a different game, its a "should I be trying to get up here at all" kind of issue. Where, for reference, the answer is usually: no. Find another way around.

Upgrading dependencies to Hadoop jitters between the two, depending on what upgrade is being proposed.

And, as the nominal assignee of HADOOP-9991, "upgrade dependencies", I get to see this.

We regularly get people submitting one line patches "upgrade your dependency so you can work with my project' —and they are such tiny diffs people think "what a simple patch, it's easy to apply"

The problem is they are one line patches that can lead to the HBase, Hive or Spark people cornering you and saying things like "why do you make my life so hard?"

Before making a leap to Java 9, we're trapped whatever we do. Upgrade: things downstream break. Don' t upgrade, things downstream break when they update something else, or pull in a dependency which has itself updated.

While Hadoop has been fairly good at keeping its own services stable, where it causes problems is in applications that pull in the Hadoop classpath for their own purposes: HBase, Hive, Accumulo, Spark, Flink, ...

Here's my personal view on the risk factor of various updates.

Critical :

We know things will be trouble —and upgrades are full cross-project epics

  • protobuf., This will probably never be updated during the lifespan of Hadoop 2, given how google broke its ability to link to previously generated code.
  • Guava. Google cut things. Hadoop ships with Guava 11 but has moved off all deleted classes so runs happily against Guava 16+. I think it should be time just to move up, on the basis of Java 8 compatibility alone.
  • Jackson. The last time we updated, everything worked in Hadoop, but broke HBase. This makes everyone very said
  • In Hive and Spark: Kryo. Hadoop core avoids that problem; I did suggest adding it purely for the pain it would cause the Hive team (HADOOP-12281) —they knew it wasn't serious but as you can see, others got a bit worried. I suspect it was experience with my other POM patches that made them worry.
I think a Jackson update is probably due, but will need conversations with the core downstream projects. And perhaps bump up Guava, given how old it is.

High Risk

Failures are traumatic enough we're just scared of upgrading unless there's a good reason.
  • jetty/servlets. Jetty has been painful (threads in the Datanodes to peform liveness monitoring of Jetty is an example of workarounds), but it was a known and managed problem). Plan is to move off jetty entirely and -> jersey + grizzly.
  • Servlet API.
  • jersey. HADOOP-9613 shows how hard that's been
  • Tomcat. Part of the big webapp set
  • Netty —again, a long standing sore point (HADOOP-12928, HADOOP-12927)
  • httpclient. There's a plan to move off Httpclient completely, stalled on hadoop-openstack. I'd estimate 2-3 days there, more testing than anything else. Removing a dependency entirely frees downstream projects from having to worry about the version Hadoop comes with.
  • Anything which has JNI bindings. Examples: leveldb, the codecs
  • Java. Areas of trauma: Kerberos, java.net, SASL,

With the move of trunk to Java 8, those servlet/webapp versions all need to be rolled.

Medium Risk

These are things where we have to be very cautious about upgrading, either because of a history of brittleness, or because failures would be traumatic
  • Jets3t. Every upgrade of jets3t moved the bugs. It's effectively frozen as "trouble, but a stable trouble", with S3a being the future
  • Curator 2.x ( see HADOOP-11612 ; HADOOP-11102) I had to do a test rebuild of curator 2.7 with guava downgraded to Hadoop's version to be confident that there were no codepaths that would fail. That doesn't mean I'm excited by Curator 3, as it's an unknown.
  • Maven itself
  • Zookeeper -for its use of guava.
Here I'm for leaving Jets3t alone; and, once that Guava is updated, curator and ZK should be aligned.

Low risk:

Generally happy to upgrade these as later versions come out.

I don't know which category the AWS SDK and azure SDKs fall into. Their jackson SDK dependency flags them as a transitive troublespot.

Life would be much easier if (a) the guava team stopped taking things away and (b) either jackson stopped breaking things or someone else produced a good JSON library. I don't know of any -I have encountered worse.

2016-05-31 Update: ZK doesn't use Guava. That's curator I'm thinking of.  Correction by Chris Naroth.


Distributed Testing: making use of the metrics

3Dom, St Werburghs


In this article I introduce the concept of Metrics-first Testing, and show how instrumenting the internals of classes, enabling them to be published as metrics, enables better testing of distributed systems, while also offering potential to provide more information in production.

Exporting instrumented classes in the form of remotely accessible metrics permits test runners to query the state of the System Under Test, both to make assertions about its state, and to collect histories and snapshots of its state for post-run diagnostics.

This same observable state may be useful in production —though there is currently no evidence to support this hypothesis.

There are a number of issues with the concept. A key one is if these metrics do provde useful in production, then they become part of the public API of the system, and must be supported across future versions.

Introduction: Metrics-first Testing

I've been doing more scalatest work, as part of SPARK-7889, SPARK-1537, SPARK-7481. Alongside that, in SLIDER-82, anti-affine work placement across a YARN cluster, And, most recently, wrapping up S3a performance and robustness for Hadoop 2.8, HADOOP-11694, where the cost of an HTTP reconnect appears on a par with reading 800KB of data, meaning: you are better off reading ahead than breaking a connection on any forward seek under ~900KB. (that's transatlantic to an 80MB FTTC connection; setup time is fixed, TCP slow start also means that the longer the connection is held, the better the bandwidth gets)

On these projects, I've been exploring the notion of metrics-first testing. That is: your code uses metric counters as a way of exposing the observable state of the core classes, and then tests can query those metrics, either at the API level or via web views.

Here's a test for HADOOP-13047,: S3a Forward seek in stream length to be configurable

  public void testReadAheadDefault() throws Throwable {
    describe("Verify that a series of forward skips within the readahead" +
        " range do not close and reopen the stream");
    executeSeekReadSequence(32768, 65536);
    assertEquals("open operations in " + streamStatistics,
        1, streamStatistics.openOperations);

Here's the output

testReadAheadDefault: Verify that a series of forward skips within the readahead
  range do not close and reopen the stream

2016-04-26 11:54:25,549 INFO  Reading 623 blocks, readahead = 65536
2016-04-26 11:54:29,968 INFO  Duration of Time to execute 623 seeks of distance 32768
 with readahead = 65536: 4,418,524,000 nS
2016-04-26 11:54:29,968 INFO  Time per IOP: 7,092,333 nS
2016-04-26 11:54:29,969 INFO  Effective bandwidth 0.000141 MB/S
2016-04-26 11:54:29,970 INFO  StreamStatistics{OpenOperations=1, CloseOperations=0,
  Closed=0, Aborted=0, SeekOperations=622, ReadExceptions=0, ForwardSeekOperations=622,
  BackwardSeekOperations=0, BytesSkippedOnSeek=20381074, BytesRead=20381697,
  BytesRead excluding skipped=623, ReadOperations=0, ReadsIncomplete=0}

I'm collecting internal metrics of a stream, and using that to make assertions about the correctness of the code. Here, that if I set the readahead range to 64K, then a series of seek and read operations stream through the file, rather than break and reconnect the HTTPS link.

This matters a lot, as shown by one of the other tests, which times an open() call as well as that to actually read the data

testTimeToOpenAndReadWholeFileByByte: Open the test file
  s3a://landsat-pds/scene_list.gz and read it byte by byte

2016-04-26 11:54:47,518 Duration of Open stream: 181,732,000 nS
2016-04-26 11:54:51,688 Duration of Time to read 20430493 bytes: 4,169,079,000 nS
2016-04-26 11:54:51,688 Bandwidth = 4.900481  MB/S
2016-04-26 11:54:51,688 An open() call has the equivalent duration of
  reading 890,843 bytes

Now here's a Spark test using the same source file and s3a connector

ctest("CSVgz", "Read compressed CSV", "") {
    val source = sceneList
    sc = new SparkContext("local", "test", newSparkConf(source))
    val sceneInfo = getFS(source).getFileStatus(source)
    logInfo(s"Compressed size = ${sceneInfo.getLen}")
    val input = sc.textFile(source.toString)
    val (count, started, time) = duration2 {
    logInfo(s" size of $source = $count rows read in $time nS")
    assert(ExpectedSceneListLines <= count)
    logInfo(s"Filesystem statistics ${getFS(source)}")

Which produces, along with the noise of a local spark run, some details on what the FS got up to
2016-04-26 12:08:25,901  executor.Executor Running task 0.0 in stage 0.0 (TID 0)
2016-04-26 12:08:25,924  rdd.HadoopRDD Input split: s3a://landsat-pds/scene_list.gz:0+20430493
2016-04-26 12:08:26,107  compress.CodecPool - Got brand-new decompressor [.gz]
2016-04-26 12:08:32,304  executor.Executor Finished task 0.0 in stage 0.0 (TID 0). 
  2643 bytes result sent to driver
2016-04-26 12:08:32,311  scheduler.TaskSetManager Finished task 0.0 in stage 0.0 (TID 0)
  in 6434 ms on localhost (1/1)
2016-04-26 12:08:32,312  scheduler.TaskSchedulerImpl Removed TaskSet 0.0, whose tasks
  have all completed, from pool 
2016-04-26 12:08:32,315  scheduler.DAGScheduler ResultStage 0 finished in 6.447 s
2016-04-26 12:08:32,319  scheduler.DAGScheduler Job 0 finished took 6.560166 s
2016-04-26 12:08:32,320  s3.S3aIOSuite  size of s3a://landsat-pds/scene_list.gz = 464105
  rows read in 6779125000 nS

2016-04-26 12:08:32,324 s3.S3aIOSuite Filesystem statistics
  partSize=104857600, enableMultiObjectsDelete=true,
  statistics {
    20430493 bytes read,
     0 bytes written,
     3 read ops,
     0 large read ops,
     0 write ops},
     metrics {{Context=S3AFileSystem}

What's going on here?

I've instrumented S3AInputStream, instrumentation which is then returned to its S3AFileSystem instance.
This instrumentation can not only be logged, it can be used in assertions.

And, as the FS statistics are actually Metrics2 data, they can be collected from running applications.

By making the observable state of object instances real metric values, I can extend their observability from unit tests to system tests —all the way to live clusters.

  1. This makes assertions on the state of remote services a simple matter of GET /service/metrics/$metric + parsing.
  2. It ensures that the internal state of the system is visible for diagnostics of both test failures and production system problems. Here: how is the file being accessed? Is the spark code seeking too much —especially backwards? Were there any transient IO problems which were recovered from?
    These are things which the ops team may be grateful for in the future, as now there's more information about what is going on.
  3. It encourages developers such as myself to write those metrics early, at the unit test time, because we can get immediate tangible benefit from their presence. We don't need to wait until there's some production-side crisis and then rush to hack in some more logging. Classes are instrumented from the outset. Indeed, in SPARK-11373 I'm actually implementing the metrics publishing in the Spark History server —something the SPARK-7889 code is ready for.
Metrics-first testing, then, is instrumenting the code and publishing it for assertions in unit tests, and for downstream test suites.

I'm just starting to experiment with this metrics-first testing.

I have ambitions to make metric capture and monitoring a more integral part of test runs. In particular, I want test runners to capture those metrics. That's either by setting up the services to feed the metrics to the test runner itself, capturing the metrics directly by polling servlet interfaces, or capturing them indirectly via the cluster management tools.

Initially that'll just be a series of snapshots over time, but really, we could go beyond that and include in test reports the actual view of the metrics: what happened to various values over time? when when Yarn timeline server says its average CPU was at 85%, what was the spark history server saying its cache eviction rate was?

Similarly, those s3a counters are just initially for microbenchmarks under hadoop-tools/hadoop-aws, but they could be extended up the stack, through Hive and spark queries, to full applications. It'll be noisy, but hey, we've got tooling to deal with lots of machine parseable noise, as I call it: Apache Zeppelin.

What are the flaws in this idea?


Relevance of metrics beyond tests.

There's the usual issue: the metrics we developers put in aren't what the operations team need. That's inevitable, but at least we are adding lots of metrics into the internal state of the system, and once you start instrumenting your code, you are more motivated to continue to add the others.


Representing Boolean values

I want to publish a boolean metric: has the slider App Master had a node map update event from the YARN RM? That's a bool, not the usual long value metrics tools like. The fix there is obvious for anyone who has programmed in C:
public class BoolMetric extends AtomicBoolean implements Metric, Gauge<integer> {

  public Integer getValue() {
    return get() ? 1 : 0;
It's not much use as a metric, except in that case that you are trying to look at system state and see what's wrong. It actually turns out that you don't get an initial map —something which GETs off the Coda Hale JSON metric servlet did pick up in a minicluster test. It's already paid for itself. I'm happy. It's just it shows the mismatch between what is needed to monitor a running app, things you can have triggers and graphs of, and simple bool state view.


Representing Time

I want to track when an update happened, especially relative to other events across the system. I don't see (in the Coda Hale metrics) any explicit notion of time other than histograms of performance. I want to publish a wall time, somehow. Which leaves me with two options. (a) A counter listing the time in milliseconds *when* something happened. (b) A counter listing the time in milliseconds *since* something happened. From a monitoring perspective, (b) is better: you could set an alarm if the counter value went over an hour.

From a developer perspective, absolute values are easier to test with. They also support the value "never" better, with something "-1" being a good one here. I don't know what value of "never" would be good in a time-since-event value which couldn't be misinterpreted by monitoring tools. A value of -1 could be construed as good, though if it had been in that state for long enough, it becomes bad. Similarly, starting off with LONG_MAX as the value would set alarms off immediately. Oh, and either way, the time isn't displayed as a human readable string. In this case I'm using absolute times.

I'm thinking of writing a timestamp class that publishes an absolute time on one path, and a relative time on an adjacent path. Something for everyone


 The performance of AtomicLongs

Java volatile variables are slightly more expensive than C++ ones, as they act as barrier operations rather than just telling the compiler never to cache them. But they are still simple types.

In contrast, Atomic* are big bits of Java code, with lots of contention if many threads try to update some metric. This is why Coda Hale use a an AtomicAccumulator class, one that eventually surfaces in Java 8..

But while having reduced contention, that's still a piece of java code trying to acquire and release locks.

It would only take a small change in the JRE for volatile, or perhaps some variant, atomic to implement atomic ++ and += calls at the machine code level, so the cost of incrementing a volatile would be almost the same as setting it.

We have to assume that Sun didn't do that in 1995-6 as they were targeting 8 bit machines, where even incrementing a 16 bit short value was not something all CPUs could guarantee to do atomically.

Nowadays, even watches come with 32 bit CPUs; phones are 64 bit. It's time for Oracle to look ahead and conclude that it's time for even 64 bit volatile addition to made atomic.

For now, I'm making some of the counters which I know are only being updated within thread-safe code (or code that says "should only be used in one thread") volatile; querying them won't hold up the system.


 Metrics are part of your public API

This is the troublesome one: If you start exporting information which your ops team depends on, then you can't remove it. (Wittenauer, a reviewer of a draft of this article, made that point quite clearly). And of course, you can't really tell which metrics end up being popular. Not unless you add metrics for that, and, well, you are down a slippery slope of meta-metrics at that point.

The real issue here becomes not exposing more information about the System Under Test, but exposing internal state which may change radically across versions.

What I'm initially thinking of doing here is having a switch to enable/disable registration of some of the more deeply internal state variables. The internal state of the components are not automatically visible in production, but can be turned on with a switch. That should at least make clear that some state is private.

However, it may turn out that the metrics end up being invaluable during troubleshooting; something you may not discover until you take them away.

Keeping an eye on troubleshooting runbooks and being involved in support calls will keep you honest there.


Pressure to align your counters into a bigger piece of work

For the S3a code, this surfaces in HDFS-10175; a proposal to make more of those FS level stats visible, so that at the end of an SQL query run, you can get aggregate stats on what all filesystems have been up to. I do think this is admirable, and with the costs of an S3 HTTP reconnect being 0.1s, it's good to know how many there are.

At the same time, these overreaching goals shouldn't be an excuse to hold up the low level counters and optimisations which can be done at a micro level —what they do say is "don't make this per-class stuff public" until we can do it consistently. The challenge then becomes technical: how to collect metrics which would segue into the bigger piece of work, are useful on their own, and which don't create a long term commitment of API maintenance.



As described by Jakob Homan: " Large distributed systems can overwhelm metrics aggregators.  For instance, Samza jobs generated so many metrics LI's internal system blanched and we had to add a feature to blacklist whole metric classes "

These low-level metrics may be utterly irrelevant to most processes, yet, if published and recorded, will add extra load to the monitoring infrastructure.

Again, this argues for making the low-level metrics off by default, unless explicitly enabled by a debugging switch.

In fact, it almost argues for having some metric enabling profile similar to log4J settings, where you could turn on, say, the S3a metrics at DEBUG level for a run, leaving it off elsewhere. That could be something to investigate further.

Perhaps I could start by actually using the log level of the classes as the cue to determine which metrics to register:
if (LOG.isDebugEnabled) {

Related work

I've been trying to find out who else has done this, and what worked/didn't work, but there doesn't seem too much in published work. There's a lot of coverage of performance testing —but this isn't that. This about a philosophy of instrumenting code for unit and system tests, using metrics as that instrumentation —and in doing so not only enabling better assertions to be made about the state of the System Under Test, but hopefully providing more information for production monitoring and diagnostics.


In Distributed Testing, knowing more about state of the System Under Test aids both assertions and diagnostics. By instrumenting the code better, or simply making the existing state accessible as metrics, it becomes possible to query that state during test runs. This same instrumentation may then be useful in the System In Production —though that it is something which I currently lack data about.


It's not often (ever?) that I get people to review blog posts before I publish them: this one I did as it's introducing concepts in system testing which impacts everything from code to production system monitoring. Thank you to the reviewers: Jakob Homan, Chris Douglas, Jay Kreps, Allen Wittenauer, Martin Kleppman.

I don't think I've addressed all their feedback, especially Chris's (security, scope,+ others), and Jay went into detail on how structured logging would be superior —something I'll let him expound on in the Confluent blog.

Interestingly, I am exposing the s3a metrics as log data, —it lets me keep those metrics internal, and lets me see their values in Spark tests without changing that code.

AW pointed out that I was clearly pretty naive in terms of what modern monitoring tools could do, and should do more research there: On first blush, this really feels naïve as to the state of the art of monitoring tools, especially in the commercial space where a lot of machine learning is starting to take shape (e.g., Netuitive, Circonus, probably Rocana, etc, etc). Clearly I have to do this...

(Artwork: 3Dom in St Werburgh's)


Testing against S3 and object stores —keep your keys out of SCM

Nobody wants to commit their AWS credentials to an SCM repository, not when there are robots looking for your keys so they can spin up bitcoin mining VMs.
Sunday Morning St Werburghs

That's why git-secrets is something you should preinstall on all repos where you go near repositories. Unfortunately, it's pattern matching picks up a lot of invalid patterns in the Hadoop codebase, and a few in spark. In the absence of a decent pattern to only scan text files, after installing I edit out the regexps from .git/config and rely on it scanning purely for the strings in ~/.aws/credentials.

That keeps the keys out of SCM, if everyone is set up that way.

Which leaves the next problem: if you save your hadoop configuration files to SCM, how do you get AWS keys into system configurations?

The answer: XInclude

This is one of my configuration files, specifically hadoop-tools/hadoop-aws/src/test/resources/auth-keys.xml

This is a special file, tagged as .gitignore to keep it out of the repos. Yet still the keys are in that source tree, still at risk of sneaking out.

To deal with this, stick an absolute XInclude reference into the file, pointing to the configuration file where the keys really live.

  <include xmlns="http://www.w3.org/2001/XInclude"
    href="file:///home/stevel/.aws/hadoop-keys.xml" />

This tells the Hadoop config loader to grab the keys from a file in ~/.aws; one which lives carefully out of the SCM-managed space.

Provided the contents of that directory are kept private, my keys will not get checked in.

They can, however leak, in various ways, including
  • In the /config URL of a service
  • Code which accidentally logs it.
  • If it gets saved to files/datasets used in bug reports.
  • Malicious code running in your system which grabs the keys and exports/logs them. This is why no OSS jenkins servers are set up with the keys needed to test against object store.
Clearly, more should be done to lock it down.

There are some patches under the S3A phase 2 JIRA to support credentials better, [HADOOP-12723, HADOOP-12537]. This is somewhere where anyone who can test these patches is invaluable. I'm currently doing the review-then-commit in S3, but I'm only doing it at weekends, when I have some spare time, and even then, as a full test run takes 2+ hours, not reviewing very much.

Anyone can review Hadoop patches, confirm whether they worked or not, show how they didn't. It's the way to verify that forthcoming code from other people works for you, and of contributing time and effort into the community. Other people have done the coding —time to help with the testing.

Especially now I've just documented how to keep the keys safe when you set up S3 for a test run. Once the auth-keys.xml file is in the src/test/resources directory, Maven tests s3, depending on the specific properties, s3a, s3n and s3a are all tested.

Update: 2016-06-20: I've discovered that if you use the id:secret in a URL for an S3x filesystem, e.g s3a://AWS01:sec/ret@mybucket then the secrets get logged everywhere, because the Hadoop code assumes there aren't secrets in the Path or Filesystem URIs. We've cranked back a bit on the leakage (by stripping it from the FS URIs), but it still gets out everywhere. Fix: don't do that. Hadoop 2.8 will explicitly tell you off whenever you do so.

[photo: Mina Road Park, St Werburghs]


Voxxdev Bristol 2016: theory meets practitioners

I've been hanging out with other bits of the Bristol tech/dev world recently, rather than the usual big data suspects.

Two weeks ago, I actually infiltrated the local Test meetup, which was entertaining, not just for the Bath Gem Ale which JustEat has on tap in their meeting area, but because I sit there, beer in hand, while speakers covered Applied Exploratory Testing, what its like turning up as a QA team to a new company and the patterns-of-untestability that you encounter (and how to start to get things under control).

This week, I took a break from worrying about the semantics of flush() and its impact of durability of timeline 1.5 event histories. (i.e. why my incomplete apps aren't showing up in an ATS-backed  Spark History server if file:// is the intermediate FS of the test run). Wandered down to the Watershed Cinema with Tom White and a light hangover related to Tom White's overnight stay including an evening visit the Bravas Tapas Bar —and into a one-day dev conference, Voxxdev Bristol 2016.

It was a good day. Oracle have been putting a lot of effort into the conf as a way raising visibility of what's going on in tech in the area to make more people aware the west of england a more interesting place to be than London, and with other companies and one of the local universities had put together a day-long conference.

I was one of the speakers; I'd put in my Household Infosec talk, but the organisers wanted something more code-related, and opted for Hadoop and Kerberos, the Madness Beyond The Gate. I don't think that was the right talk for the audience. It's really for people writing code to run inside a Hadoop cluster, and to explain to the QE and support people that the reason they suffer so much is those developers aren't testing on secure clusters. (that's the version I gave at Cloudera Palo Alto last month). When you are part way into a talk and you realise that you can't assume the audience knows how HDFS works then you shouldn't really be talking to them about retrieving block-tokens from the NN from a YARN app handed off a delegation token in the launch context by a client authed against the KDC. Normally I like a large fraction of the audience to come out of a talk feeling they'd benefited; this time I'm not sure.

I felt a bit let down the oracle big data talk, though impressed that people are still writing Swing apps. I was even more disappointed by the IoT talk, where he not only accused Hadoop of being insecure (speaker missed my talk, then), most of his slides seemed lifted from elsewhere: one a cisco IoT arch, one dell hadoop cluster design. Julio pointed out later that the brontobyte slide was one HP Labs have been using. Tip: if you use other's slides, either credit them or make sure the authors aren't attendees.

I really liked some of the other talks. There was a great keynote by a former colleague, Dave Cliff, now at Bristol Uni, talking about what they are up to. This is the lecture series on their cloud computing course. 

Cloud Computing Curriculum

That's a big change given that in 2010, all they had was a talk Julio and I gave in the HPC course

I might volunteer to give one of the new course's talks in exchange for being able to sit in on the other lectures (and exemption from exams, tutorials and homework, obviously)

My favourite talk turned out to be Out of the Fire Swamp, by Adrian Colyer.

Adrian writes "The morning paper" blog, which is something I'm aware of, in awe of and in fear of. Why fear? There's too many papers to read; I can't get through 1/day and to track the blog would only make it clear how behind I was. I already have a directory tree full of papers I think I need to understand. Of course, if you do read a related paper/day, it probably gets easier, except I'm still trying to complete [Ulrich99], as my and relate it to modern problems.

Adrian introduced the audience to data serialization, causality and happens-before, then into linearalizability [HW90]

Read Committed

This was a really good talk.

Expectations and Reality

All the code we write is full of assumptions. We assume that n + 1 > n, though we know in our head that if n = 2^31 and its stored in a signed int32, that doesn't hold. (more formally, in two's complement binary arithmetic, for all registers of width w, n+1 > n only holds for all n where n < 2^(w-1).

Sometimes even that fundational n+1 > n assumption catches us out. We assume that two assignments in source code happen in order, though in fact not only does the JVM reserve the right to re-order things, even, in the past, wrongly —and anyway, the CPU can reorder stuff as well.

What people aren't aware of in modern datacentre-scale computing is what assumptions the systems underneath have made in order to give their performance/consistency/availability/liveness/persistence features or whatever it is they offer. To put differently: we think we know what the systems we depend on do, but every so often we find out our assumptions were utterly wrong. What Adrian covered is the foundational CS-complete assumptions, that you had really be asking hard questions about when you review technologies.

He also closed with a section on future system trends, especially storage, with things like  non-volatile DIMMS (some capacitor + SSD to do a snapshot backup on power loss), faster SSD and those persistent technologies with performance between DRAM and SSD: looking at a future where tomorrow's options about durability vs. retrieval time are going to be (at least from some price points), significantly different from today's. Which means that we'd better stop hard coding those assumptions into our systems.

Overall, a nice event, especially given it in the second closest place to my house where you could hold a conference (Bristol University would have been 7 minutes rather than 15 minutes walk). I had the pleasure of meeting some new people in the area working on stuff, including the illustrious James Strachan who'd come up from Somerset for the day.

I liked the blend of some CS "Lamport Layer" work with practical stuff; gives you both education and things you can make immediate use of. I know Martin Kleppman has been going round evangelising classic distributed computing problems to a broader audience, and Berlin Buzzwords 2016 has a talk Towards consensus on distributed consensus  it's clearly something the conferences need.

If there was a fault, as well as some of the talks not being ideal for the audience (mine), I'd say it's got the usual lack of diversity of a tech conference. You could say "well. that's the industry", but it doesn't have to be, and even if that is the case it doesn't have to be so unbalanced in the speakers. In the BBuzz even not only are there two women keynoting, we submission reviewers were renewing the talks anonymously: we didn't know who was submitting, instead going on the text alone.

For the next Bristol conference, I'd advocate going closer in Uni/industry collaboration by offering some of the students tickets. Maybe even some of the maths & physics students rather than just CS. I also think maybe there should be a full strand dedicated to CS theory. Seriously: we can do causality, formality, set theory & relational algebra, paxos+ZK+REEF, reliability theory, graphs etc. Things that apply to the systems we use, but stuff people kind of ignore. I like this idea ... I can even think of a couple of speakers. Me? I'd sit in the audience trying to keep up.

[HW90] Herlihy and Wing, Linearizability: A Correctness Condition for Concurrent Objects, 1990.

[Ulrich99] Ulrich, A.W., Zimmerer, P. and Chrobok-Diening, Test architectures for testing distributed system, 1999. Proceedings of the 12th International Software Quality Week.


trouble ahead

It's now over a month since I outlined my 2016 plans: get fitter, do more on testability.

What's the progress?

I've got a place on the 2016 Fred Whitton Challenge, which is widely regarded as the hardest one-day "fun" ride in the UK.

Having done it in 2014, I am minded to concur

That time 4 people got helicoptered off; I have no intention of qualifying for the same transport this year.

Instead: I now have to get fit enough to do 110 miles up 30% gradients, when I'm currently fit enough to do 25-30 miles.

In 10 weeks.