Developer INFOSEC in a Post-Sony era

There's going to be a dev-conference in Bristol in February, voxxed days bristol;
where I'll be presenting my Hadoop and Kerberos tak,
, maybe with a demo of the Slider kdiag command, which after a couple of iterations I intend to move into Hadoop core as part of HADOOP-12649: Improve UGI diagnostics and failure handling. Any version which gets into hadoop core will be able to use some extra diagnostics I hope to add to UGI itself, such as the ability to query the renew time of a ticket, get the address of the KDC and probe for it, maybe more. Because Kerberos doesn't like me this week, at least not with zookeeper.

It should be a fun day and local developers should think about turning up. I was at the last little developer bash in November, where there was a great mix of talks ranging from the history of the Transputer to the challenge of implementing a mobile-app only bank.


What apparently hasn't make the cut is my other proposal, Household INFOSEC in a Post-Sony era. Which is a shame, as I have the outline of a talk there which would be seminal. This wouldn't a be a talk about platitudes like "keep flash up to date", it'd be looking about my rollout of a two-tier network, with the IoT zone hooked directly up to the internet by the router provided by the ISP, and a DD-WRT router offering a separate trusted-machine subnet and high-entropy-passworded, no-SSID-published wifi network restricted to devices I can control.

What I was really going to talk about, however, was how we are leaking personal information from all the devices we use, the apps we host in them, the cars we drive, the cameras we use (guess whose camera app considers camera GPS and timestamp information non-personally identifiable and hence uploadable? That's right: Sony). We leak that information, it gets stored elsewhere, and we now depend on the INFOSEC capabilities of those entities to keep that data private. And that's hard to pull off —even with Kerberos everywhere.

One aspect of my homework here is working out what data I have on my computers which I consider sensitive.

There's photos, which are more irreplaceable than anything else: my policy there is Disaster Recovery: using Google Photos as the off site backup; a local NAS server in the trusted subnet for on-site resilience to HDD failures. That server is powered off except for the weekly backups, reducing its transitive vulnerability to any ransomware that gets onto into the trust zone.

There's passwords to web sites, especially those with purchasing rights (amazon, etc), and financial institutions (paypal, banks) —in the UK my bank uses its chipped debit cards as the physical credential for login and cash transfer, so it's fairly secure. The others: less so, especially if my browser has been intentionally/unintentionally saving form information with things like CVV numbers.

And what else matters? I've come to the conclusion it is the credentials needed to gain write access to the ASF source code repositories. Not just Hadoop —it's the Ant source, it's slider, it's anything else where I am creating code where any of the following criteria are met
  1. The build is executed by a developer or a CI tool on a machine holding information (including their own credentials) to which someone malicious wants access.
  2. The build is executed by a developer or a CI tool on a machine running within a network to which someone malicious wants access.
  3. Generated/published artifacts are executed during a build process on a machine/network to which someone malicious wants access.
  4. The production code is executed somewhere where there is data which someone malicious wants to get at or destroy, or where adverse behaviour can the system is advantageous to someone malicious.
You can point to the Hadoop stack and say: it's going that way —but so is the rest of the OSS codebase. The LAMP stack, tomcat web servers, Xerces XML parsers, open office, linux device drivers, clipboard history savers like glipper, Python statistics libraries, ... etc. We live in a world where open source is everywhere from the datacentre to the television. If anyone malicious has the opportunity to deliberately inset vulnerabilities into that code —then they get to spread them across the planet. That source code then, is both a juicy target for anyone looking for 0-day exploits, but also for inserting new ones.

We've seen attacks on Kernel.org, and the ASF. With the dominance of git as the SCM tool, and it's use of SHA-1 checksums, the value of breaking into the servers is diminishing —what you need to go is get the malicious code checked in, that is: committed using the credentials of an authorised developer.

That'll be us then.

More succinctly: if the Internet becomes the location of an arms race, we're now targets en route to strategic goals by entities and organisations against whom we don't stand a chance.

How do you defend against nation states happy to give away USB-chargeable bicycle lights at an OSS conference? Who have the ability to break through your tier-3 ISP firewall and then the second level DD-WRT router that you never locked down properly and haven't updated for three weeks. We're don't stand a chance, not really

No doubt I'll come over as excessively paranoid, but its not as if I view my personal systems a direct target. It's just the source code repos do which I do have access are potentially of interest. And with other people in the Hadoop space building those same projects, something injected into the build using my credentials then has transitive access to everyone else who checks out and builds the same codebase. That's what worries me.

WTF do we do?

Short-term I'm switching my release process to a VM that's only ever used for that, so at least the artifacts I generate aren't indirectly contaminated by malware; I also need to automate a final SHA1 audit of every dependent artifact.

Medium term: I need to come up with a plan for locking down my git ssh/svn credentials and passwords so they aren't trivially accessible to anything malicious running on any laptop of mine. I know github is moving to 2FA and U2F auth, but that's for web and API auth: not git repo access. What the Linux Kernel team have is a much better story: 2FA for granting 24h of write access from a specific IP address.

Long term: I have no idea whatsoever

[photo: two Knog lights you charge up via USB ports. We should all know to be cautious about plugging in untrusted USB sticks —but who would turn down a freebie bike light given away at an OSS developer conference?]


Remembering the glaciers: Greenland

In July 2012, while wandering around on a flight to the US, I looked out the window and got to see what Greenland looks like from above.

When this ice melts, it raises the ocean.


And the mountains will be bare rock, as they have not been since before the ice ages began.


There's something deeply tragic about a planeload of people, sitting in their seats, windows shuttered, watching the videos streamed to keep the populace happy —while if they chose to look up they could see the great icefields melting. Arguably, that's a description for society as a whole —and all of us in it.


Remembering the glaciers: Fox Glacier, New Zealand

The southern hemisphere has its own ice. I can't find any of my 1994 Peru/Chile expedition photos, so nothing of the Southern Patagonian Ice Cap, the Torres del Paine or snow-capped volcanoes.

Instead, here's one from New Zealand, the Fox Glacier.


This was on South Island, somewhere down the west coast, December 1998.

What's impressive here is how it comes down almost to sea level: we were staying nearby and this was a short and mostly flat walk from where we were staying. It's in the temperate rain forests of the region. The fact that the entire valley looks fairly glacial implies that its been bigger in the past -like many of the other Fjords on the island, it shows the glaciers once made their way out to a lower sea level. (the same goes for Scottish sea glens, presumably)

The whole glacier was making the sound of gently running water, right down by the terminal moraines. Apparently this glacier actually advanced between 2005-2009, but it's retreating again.

[Team: Steve, Bina]
[Photo: Fujichrome ASA100 slide (presumably) on an olympus zoom-lens compact camera I'd bought in NZ]


Remembering the Glaciers: Mont Pelvoux

Mont Pelvoux, Massif des Ecrins, French Alps


This was about two thirds through what turned out to be 14 hours on the move: out the mountain hut before 5 am, making our way up to the summit up a couloir from the south-west, continuing up the ice field, then continuing over the mountain and descending the other side. This was part of the descent, looking up at where we'd been: Glacier du Pelvoux.

Seracs are the waterfalls of glaciers; the ice may move a centimetre or two a day, but it doesn't do it smoothly: slowly the serac moves out past the rocks, building up weight until eventually its so heavy that the overhanging part —and usually a section behind— separates from the rest of glacier and starts to break off. Those cracks expand, and eventually the serac itself falls. I've never seen or even heard that, but it's something you fear. Hence you get up early and move fast.

This was the first time I'd had to descend seracs, downclimbing front pointing the crampons with you needing the clear the base as it opened up into a deeper crevasse. We used ice screws for protection, with body belays; we were already roped up for glacier work in general. The last person in the group had to make do with the in-situ protection, which was inevitably some iced up piece of wood. They were of course backed up by the people below, but even so: glad it wasn't me.

Google earth shows this peak in all its intensity.

I don't think I do this any more; not just the technical aspects, or the terrifying exposure on the way up, but 14 hours of continuous movement on the mountains is pretty brutal. My body just wouldn't cope, not given my knee hasn't recovered from its 2006 shredded tendon, and I doubt I have the endurance this or even for the 12 hour non-stop drive from the channel to the alps. I'm never going to get a chance to take this photo again.

[Participants: SteveL, Will W, ^Jim, Mark S]
[Photo: Ilford B&W with Canon Sureshot, manual D&P onto Ilford A4 paper, with manual dodge/burn for the sky]


Remembering the Glaciers : Sunset over Mont Blanc

Sunset over Mont Blanc

Sunset over Mont Blanc.

From the west, obviously; Will and I had turned up with our bikes near Les Contamines and haggled over how much it would cost us to camp there. We'd already eaten, so all we were paying for was a shower. They gave us a discount in exchange for making us camp in the far corner and not tell anyone what we'd paid. Given the next two days our bathing facilities were mountain streams of meltwater a few minutes off the ice, the FFr we paid for the shower was worth it.

Wikipedia tells me that that the date was 21 June, 1997. We'd been up Col de la Columbiere that morning to catch a mountain stage of the Tour de France. Pantani won the day, with Col de Joux Plane being the big finale.

We were heading in the opposite direction: south on gravel tracks over to Lac Du Roselend, then continuing a partial bike circuit of Mont Blanc.

Imagine what this would look like without the ice? Still dramatic, but red in the sunset?

[Participants: Steve L, Will W]
[Film: Fujichrome 100ASA slide+ Canon Sureshot compact]


Remembering the Glaciers

This week in Paris is the last chance for any kind of international agreement to do anything at all that to reduce the effects of global warming. There may be other conferences, later: but by then the topic of the conversation could be "it's too late". Today it's probably already too late to stay in that 2C zone, but it's not too late to try.

1991 zermatt ski panorama

I do not have any faith in my own country's politicians, given their actions this year:
  1. Essentially banning on-shore windmills through tightening planning permissions and allowing central government to block any proposals.
  2. Encouraging fracking through tax breaks and changing planning permissions to allow central government to approve any proposals.
  3. Extending the tax on carbon-generated electricity to —wait for it— renewable energy sources.
  4. Recently: killing Carbon Capture and Sequestration funding, despite it being the historic excuse for not needing to move away from coal and gas power, "CCS will handle it"
 +more. It's not a good thing to say that the Chinese government is now clearly more forward thinking on CO2 and other environmental issues of its populace than a European democracy.

For the week then; a photograph of a glacier a day. If the politicians —including our own— don't act, photographs will be all future inheritors of the planet will have of them.

Today: snow and ice above Zermatt, Switzerland, in winter 1991; skiing nearby. Not sure of the peak, but as Zermatt would be to the right of this frame (we were staying on the eastern side of the valley; this is looking south) —this could be part of the Monte Rosa group.

B&W negatives (Ilford, presumably), on compact 35mm camera, manual D&P with this image being zoomed in on part of the frame, burned in sky manually. Pre photoshop, you used you have to wave bits of cardboard above the print to get the sky dark enough.


Well, what about Groovy then?

Following on from my comments on Scala, -what about Apache Groovy?

Stokes Croft Graffiti, Sept 2015

It's been a great language for writing tests in, especially when you turn on static compilation, but there are some gotchas.
  1. It's easy to learn if you know Java, so the effort of adding it to a project is straightforward.
  2. It's got that auto log.info("timer is at $timer") string expansion. But the rules of when they get expanded are 'complex'.
  3. Lists and Maps are in the language. This makes building up structures for testing trivially easy.
  4. The maven groovy task is a nice compiler for java and groovy
  5. Groovydoc is at least 10x faster than javadoc. It makes you realise that javadoc hasn't had any engineering care for decades; right up there with rmic.
  6. Its closeness to Java makes it easy to learn, and you can pretty much paste Java code into groovy and have it work.
  7. The groovy assert statement is best in class. If it fails, it deconstructs the expression, listing the toString() value of every parameter. Provided you provide meaningful string values, debugging a failed expression is way easier than anything else.
  8. Responsive dev team. One morning some maven-always-updates artifact was broken, so a stack trace followed the transition of countries into the next day. We in Europe hit it early and filed bugs -it was fixed before CA noticed. Try that with javac-related bugs and you probably won't have got as far as working out how to set up an Oracle account for bug reporting before the sun sets and the video-conf hours begin.
As I said, I think it's great for testing. Here's some test of mine

@CompileStatic     // we want this compiled in advance to find problems early
@Slf4j             // and inject a 'log' field for logging

class TestCommonArgParsing implements SliderActions, Arguments {
  public void testList1Clustername() throws Throwable {
    // note use of [ list ]
    ClientArgs ca = createClientArgs([ACTION_LIST, 'cluster1'])
    assert ca.clusterName == 'cluster1'
    assert ca.coreAction instanceof ActionListArgs
What don't I like?
  1. Different scoping rules: everything defaults to public. Which you have to remember when switching languages
  2. Line continuation rules need to be tracked too ... you need the end of the line to be unfinished (e.g trail the '+' sign in a string concatenation). Scala makes a better job of this.
  3. Sometimes IDEA doesn't pick up use of methods and fields in your groovy source, so the refactorings may omit bits of your code.
  4. Sometimes that @CompileStatic tag results in compilation errors, normally fixed by commenting out that attribute. Implication: the static and dynamic compiler are 'different'
  5. Using == for .equals() is again, danger.
  6. The notion of truthyness in comparisons is very C/C++-ish: you need to know the rules. All null values are false, as are integer values that are zero And strings which are empty. Safe strategy: just be explicit.
  7. It's not so much type inference as type erasure, with runtime stacks as the consequence.
  8. The logic about when strings are expanded can sometimes be confusing.
  9. You can't paste from groovy to java without major engineering work. At least you can —more than you could pasting from Scala to Java— but it makes converting test code to production harder.
Here's an example of something I like. This is possible in Scala, Java 8 will make viable-ish too. It's a test which issues a REST call to get the current slider app state, pushes out new value and then spins awaiting a remote state change, passing in a method as the probe parameter.

public void testFlexOperation() {
    // get current state
    def current = appAPI.getDesiredResources()

    // create a guaranteed unique field
    def uuid = UUID.randomUUID()
    def field = "yarn.test.flex.uuid"
    current.set(field, uuid)
    repeatUntilSuccess("probe for resource PUT",
        5000, 200,
            "key": field,
            "val": uuid
        "Flex resources failed to propagate") {
      def resolved = appAPI.getResolvedResources()
      fail("Did not find field $field=$uuid in\n$resolved")

  Outcome probeForResolveConfValues(Map args) {
    String key = args["key"]
    String val  = args["val"]
    def resolved = appAPI.getResolvedResources()
    return Outcome.fromBool(resolved.get(key) == val)

That Outcome class has three values: Success, Retry and Fail; the probe returns an Outcome and the executor, repeatUntilSuccess execs the probe closure until the Duration of retries exceeds the timeout, the probe succeeds, or a Fail response triggers a fail fast. It allows my tests to iterate until success, but if that probe can detect an unrecoverable failure —bail out fast. It's effective at avoiding long sleep() calls, which introduce needless delays to fast systems, and which are incredibly brittle in slow ones.

If you look at a lot of Hadoop test failures, they're of the form "test timeout on Jenkins", with a fix of "increase sleep times". Closure-based probes, with probes that detect unrecoverable failures (e.g. network unreachable is a hard failure, different from 404, which may go away on retries). Anyway: closures are great in tests. Not just for probes, but for functions to call once a system is in a desired state (e.g. web server launched).

We use Groovy in Slider for testing; the extra productivity and those assert statements are great. So does Bigtop —indeed, we use some Bigtop code as the basis for our functional tests. But would I advocate it elsewhere? I don't know. It's close enough to Java to co-exist, whereas I wouldn't advocate using Scala, Clojure or similar purely for testing Java code *within the same codebase*.

There's also Java 8 to consider. It is a richer language, gives us those closures, just not the maps or the lists. Or the assertions. Or the inline string expansion. But...it's very close to production code, even if that production code is java-7 only -and you can make a very strong case for learning Java 8. Accordingly,

  1. I'm happy to continue with Groovy in Slider tests.
  2. For an app where the production code was all Java, I wouldn't advocate adopting groovy now: Java 8 offers enough benefits to be the way forward.
  3. If anyone wants to add java-8 closure-based tests in Hadoop trunk, I'd be supportive there too.


Prime numbers —you know they make sense

The UK government has just published its new internet monitoring bill.

Sepr's Mural and the police

Apparently one of the great concessions is they aren't going to outlaw any encryption they can't automatically decrypt. This is not a concession: someone has sat down with them and explained how RSA and elliptic curve crypto work, and said "unless you have a ban on prime numbers > 256 bits, banning encryption is meaningless". What makes the government think they have any control over what at-rest encryption code gets built into phone operating systems or between mobile apps written in California. They didn't have a chance of making this work; all they'd do is be laughed at from abroad while crippling UK developers. As an example, if releasing software with strong encryption were illegal, I'd be unable to make releases of Hadoop —simply due to HDFS encryption.

You may as well assume that nation states already do have the abilities to read encrypted messages (somehow), and deal with that by "not becoming an individual threat to a nation state". Same goes for logging of HTTP/S requests. If someone really wanted to, they could. Except that until now the technical abilities of various countries had to be kept a secret, because once some fact about breaking RC4 or DH forward encryption becomes known, software changes.

What is this communications bill trying to do then? Not so much legalise what's possible today, but give the local police forces access to similar datasets for everyday use. That's probably an unexpected side-effect of the Snowden revelations: police forces round the country saying "ooh, we'd like to know that information", and this time demanding access to it in a way that they can be open about in court.

As it stands, it's mostly moot. Google knows where you were, what your search terms were and have all your emails, not just the metadata. My TV vendor claims the right to log what I watched on TV and ship it abroad, with no respect for safe-harbour legislation. As for our new car, it's got a modem built in and if it wants to report home not just where we've been driving but whether the suspension has been stressed, ABS and stability control engaged, or even what radio channel we were on, I would have no idea whatsoever. The fact that you have to start worrying about the INFOSEC policies of your car manufacturer shows that knowing which web sites you viewed is becoming less and less relevant.

Even so, the plan to record 12 months of every IP address's HTTP(S) requests, (and presumably other TCP connections) is a big step change, and not one I'm happy about. It's not that I have any specific need to hide from the state —and if I did, I'd try tunnelling through DNS, using Tor, VPNing abroad, or some other mechanism. VPNs are how you uprate any mobile application to be private —and as they work over mobile networks, deliver privacy on the move. I'm sure I could ask a US friend to set up a raspberry pi with PPTP in their home, just as we do for BBC iPlayer access abroad. And therein lies a flaw: if you really want to identify significant threats to the state and its citizens, you don't go on about how you are logging all TCP connections, as that just motivates people to go for VPNs. So we move the people that are most dangerous underground, while creating a dataset more of interest to the MPAA than anyone else.

[Photo: police out on Stokes Croft or Royal Wedding Day after the "Second Tesco Riot"]


Concept: Maintenance Debt

We need a new S/W dev term Maintenance Debt.

Searching for this, the term shows it only crops up in the phrase Child Maintenance Debt -we need it for software.

OR Mini Loop Tour

Maintenance Debt: the Technical Debt a project takes on when they fork an OSS project.

All OSS code offers you the freedom to make a fork of the code; the right to make your own decisions as to the direction of the code. Git makes that branching a one line action git branch -c myfork.

Once you fork, have taken on the Maintenance Debt of that fork. You now have to:
  1. Build that code against the dependencies of your choice.
  2. Regression test that release.
  3. Decide which contributions to the main branch are worth cherry-picking into your fork.
  4. If you do want to cherry pick, adapting those patches to your fork. The more you change your fork, the
    more the cost of adaptation.
  5. Features you add to your branch which you don't contribute back become yours to maintain forever.
  6. If the OSS code adds a similar feature to yours, you are faced with the choice of adapt or ignore. Ignore it and your branch is likely to diverge fast and cherry-picking off the main branch becomes near-impossible.
That's what Maintenance Debt is, then: the extra work you add when deciding to fork a project. Now, how to keep that cost down?
  1. Don't fork.
  2. Try and get others to add the features for you into the OSS releases.
  3. As people add the features/fixes you desire, provide feedback on that process.
  4. Contribute tests to the project to act as regression tests on the behaviours you need. This is a good one as with the tests inside a project, their patch and jenkins builds will catch the problem early, rather than you finding them later.
  5. If you do write code, fixed and features, work to get them in. It's not free; all of us have to put in time and effort to get patches in, but you do gain in the long term.
  6. Set up your Jenkins/CI system so that you can do builds against nightly releases of the OSS projects you depend on (Hadoop publishes snapshots of branch 2 and trunk for this). Then complain when things break.
  7. Test beta releases of your dependencies, and the release candidates, and complain when things break. If you wait until the x.0 releases, the time to get a fix out is a lot longer —or worse, someone can declare that a feature is now "fixed" and cannot be reverted.

If you look at that list, testing crops up a lot. That's because compile and test runs are how you find out regressions. Even if you offload the maintenance debt to others, validating that their work meets your needs is a key thing to automate.

Get your regression testing in early.

[photo: looking west to the Coastal Range, N/W of Rickreall, Willamette Valley, OR, during a 4 day bike tour w/ friends and family]



Stokes Croft Graffiti, Sept 2015

I've been a busy bunny writing what has grown into a fairly large Spark patch: SPARK-1537, integration with the YARN timeline server. What starts as a straightforward POST event, GET event list, GET event code, grows once you start taking into account Kerberos, transient failures of the endpoints, handling unparseable events (fail? Or skip that one?), compatibility across versions. Oh, and testing all of this; I've got tests which spin up the YARN ATS and the Spark History server in the same VM, either generate an event sequence and verify it all works -or even replay some real application runs.

And in the process I have learned a lot of Scala and some of the bits of spark.

What do Iike?
  • Type inference. And not the pretend inference of Java 5 or groovy
  • The match/case mechanism. This maps nicely to the SML case mechanism, with the bonus of being able to add conditions as filters (a la Erlang).
  • Traits. They took me while to understand, until I realised that they were just C++  mixins with a structured inheritance/delegation model. And once so enlightened, using them became trivial. For example, in some of my test suites, the traits you mix in define what it is services bring up for the test cases.
  • Lists and maps as primary language structures. Too much source is frittered away in Java creating those data structures.
  • Tuples. Again, why exclude them from a language?
  • Getting back to functional programming. I've done it before, see.

What am I less happy about?
  • The Scala collections model. Too much, too complex.
  • The fact that it isn't directly compatible with Java lists and maps. Contrast with Groovy.
  • Scalatest. More the runner than the tests, but the ability to use arbitrary strings to name a test case, means that I can't run (at least via maven) a specific test case within a class/suite by name. Instead I've been reduced to commenting out the other tests, which is fundamentally wrong. 
  • I think it's gone overboard on various features...it has the, how do I say it, C++ feel.
  • The ability to construct operators using all the symbols on the keyboard may lead to code less verbose than java, but, when you are learning the specific classes in question, it's pretty impenetrable. Again, I feel C++ at work.
  • Having to look at some SBT builds. Never use "Simple" in a spec, it's as short-term as "Lightweight" or "New". I think I'll use "Complicated" in the next thing I build, to save time later.
Now, what's it like going back to doing some Java code? What do I miss?
  • Type inference. Even though its type inference is a kind that Milner wouldn't approve of, it's better than not having one.
  • Semicolons being mostly optional. Its so easy to keep writing broken code.
  • val vs var. I know, Java has "final", but its so verbose we all ignore it.
  • Variable expansion in strings. Especially debugging ones.
  • Closures. I look forward to the leap to Java 8 coding there.
  • Having to declare exceptions. Note than in Hadoop we tend to say "throws IOException", which is a slightly less blatant way of saying everything "throws Exception". We have to consider Java's explicit exception naming idea one not to repeat on the grounds it makes maintenance a nightmare, and precludes different implementations of an interface from having (explicitly) different failure modes. 
You switch back to Java and the code is verbose —and a lot of that is due to having to declare types everywhere, then build up lists and maps one by one, iterating over them equally slowly. Again, Java 8 will narrow some of that gap.

When I go back to java, what don't I miss?
  • A compiler that crawls. I don't know why it is so slow, but it is. I think the sheer complexity of the language is a likely cause.
  • Chained over-terseness. Yes, I can do a t.map.fold.apply chain in Spark, but when you see a stack trace, having one action per line makes trying to work out what went wrong possible. It's why I code that way in Java, too. That said, I find myself writing more chained operations, even at the cost of stack-trace debuggability. Terseness is corrupting.
One interesting thing is that even though I've done big personal projects in Standard ML, we didn't do team projects in it. While I may be able to sit down and talk about correctness proofs of algorithms in functional programming, I can't discuss maintenance of someone else's code they wrote in a functional language, or how to write something for others to maintain.

Am I going to embrace Scala as the one-and-true programming language? No. I don't trust it enough yet, and I'd need broader use of the language to be confident I was writing things that were the right architecture.

What about Scala as a data-engineering language? one stricter than Python, but nimble enough to use in notebooks like Zepplin?

I think from a pure data-science perspective, I'd say "Work at the Python level". Python is the new SQL: something which, once learned, can be used broadly. Everyone should know basic python. But for that engineering code, where you are hooking things up, mixing in existing Java libraries, Hadoop API calls and using things like Spark's RDDs and Dataframes, Scala works pretty well.


Time on multi-core, multi-socket servers

Stokes Croft Graffiti, Sept 2015

In Distributed Computing the notion of "when-ness" is fundamental; Lamport's "Time, Clocks, and the. Ordering of Events in a Distributed System" paper is considered one of the foundational pieces of work.

But what about locally?

in the Java APIs, we have: System.currentTimeMillis() and System.nanoTime() to return time.

we experienced developers "know" that currentTimeMillis() is on the "wall clock", so that if things happen to that clock: manual/NTP clock shifts, VM migration, that time can suddenly jump to a new value. And for that reason, nanoTime() is the one that we should really be using to measure time, monotonically.

Except I now no longer trust it. I've known for a long time that CPU frequency could change its rate, but as of this week I've now discovered that on a multi-socket (And older multi-core system), the nanoTime() value may be or more of:

  1. Inconsistent across cores, hence non-monotonic on reads, especially reads likely to trigger thread suspend/resume (anything with sleep(), wait(), IO, accessing synchronized data under load).
  2. Not actually monotonic.
  3. Achieving a consistency by querying heavyweight counters with possible longer function execution time and lower granularity than the wall clock.
That is: modern NUMA, multi-socket servers are essentially multiple computers wired together, and we have a term for that: distributed system.

The standard way to read nanotime on an x86 part is reading the TSC counter, via the RDTSC opcode. Lightweight, though actually a synchronization barrier opcode.

Except every core in a server may be running at a different speed, and so have a different value for that counter. When code runs across cores, different numbers can come back.

In Intel's Nephalem chipset the TSC is shared across all cores on the same die, and clocked at a rate independent of the CPU: monotonic and consistent across the entire socket. Threads running in any core in the same die will get the same number from RDTSC —something that System.nanoTime() may use.

Fill in that second socket on your server, and you have lost that consistency, even if the parts and their TSC counters are running forwards at exactly the same rate. Any code you had which relied on TSC consistency is now going to break.

This is all ignoring virtualization: the RDSTC opcode may or may not be virtual. If it is: you are on your own.

Operating systems are aware of this problem, so may use alternative mechanisms to return a counter: which may be neither monotonic nor fast.

Here then, is some reading on the topic
The conclusion I've reached is that except for the special case of using nanoTime() in micro benchmarks, you may as well stick to currentTimeMillis() —knowing that it may sporadically jump forwards or backwards. Because if you switched to nanoTime(), you don't get any monotonicity guarantees, it doesn't relate to human time any more —and may be more likely to lead you into writing code which assumes a fast call with consistent, monotonic results.


iOS 8.4, the windows vista of ipad

I like to listen to music while coding, with my normal strategy being "on a monday, pick an artist, leave them playing all week". It's a low-effort policy. Except iOS 8.4 has ruined my workflow.

iOS 8.4 music player getting playlists broken by building a sequence of the single file in the list

Now while I think Ian Curtis's version of Sister Ray is possibly better than the Velvet Underground's, it doesn't mean I want to listen to it completely, yet this is precisely what it appears to want to do. Both when I start the playlist, and sometimes even when it's been happily mixing the playlist. Sometimes it just gets into a state where the next (shuffled) track is the same as the current track, forever. And before anyone says "hey, you just hit the repeat-per-track option", here's the fun bit: it switched from repeat playlist to repeat track, all on its own. That implies a state change, either in some local variable (how?) or that the app is persisting state and reloading it, and that persist/reload changed the value. As a developer, I suspect the latter, as it's easier to get a save/restore of an enum wrong.

The new UI doesn't help. Apparently using Siri helps, as you can just say "shuffle" and let it do the rest. I couldn't get that far. Because every time I asked it to play my music, it warns me that this will break the up next list. That's the one that's decided my entire playlist consists of one MP3, Sister Ray covered by Joy Division.

If one thing is clear:not only is the UI of iOS 8.4 music a retrograde step, it was shipped without enough testing, or with major known bugs. I don't know which is worse.

Siri (and Cortana, OK google and Alexa) are all showing how speech recognition has improved over time, and we are starting to see more AI-hard applications running in mobile devices (google translate on Android is another classic) —but it's moot if the applications that the speech recognitions systems are hooked up to are broken.

Which is poignant, isn't it:
  • Cutting edge speech recognition software combining mobile devices & remote server-side processing: working
  • Music player application of a kind which even Apple have been shipping for over a decade, and which AMP and Napster had nailed down earlier: broken.
The usual tactic: rebooting? All the playlists are gone, even after a couple of attempts at resyncing with itunes.

That's in then: I cannot use the Apple music app on the iPad to listen to my music. Which given that a key strategic justification for the 8.4 release is the Apple Music service, has to be a complete disaster.

This reminds me so much of the windows Vista experience: an upgrade that was a downgrade. I had vista on a laptop for a week before sticking linux on. I don't have that option here, only the promise that iOS 9 will make things "better"

I would go back to the ipod nano, except I can't find the cable for that, so have switched to google play and streaming my in-cloud music backup. Which, from an apple strategic perspective, can't rate very highly, not if I am the only person abandoning the music player for alternatives that actually work.


Book Review, Hadoop Security, and distributed security in general

I've been reading the new ORA book, Hadoop Security, by Ben Spivey and Joey Echeverria. There's not many reviews up there, so I'll put mine up

  • reasonable intro to kerberos hadoop clusters
  • covers the identity -> cluster user mapping problem
  • ACLs in HDFS, YARN &c covered nicely —explanation and configuration
  • Shows you pretty much how to configure every Hadoop service for authenticated and authorized access, audit loggings and data & transport encryption.
  • has Sentry coverage, if that matters to you
  • Has some good "putting it all together" articles
  • Index seems OK.
  • Avoids delving into the depths of implementation (strength and weakness)

Overall: good from an ops perspective, for anyone coding in/against Hadoop, background material you should understand —worth buying.

Securing Distributed Systems

I'd bought a copy of the ebook while it was still a work in progress, so I got to see the original Chapter 2, "securing distributed systems: chapter to come". I actually think they should have left that page as it is on the basis that Distributed System Security is a Work in Progress. And while it's easy for all of us to say "defence in depth", none of us really practice that properly even at home. Where is the two-tier network with the fundamentally untrustable IoT layer: TVs, light bulbs, telephones, bittorrent servers, on a separate subnet from the critical household infrastructure from the desktops, laptops and home servers. How many of us keep our ASF, SSH and github credentials on an encrypted USB stick which must be unlocked for use? None of us. Bear that in mind whenever someone talks about security infrastructure: ask them how they lock down their house. (*)

Kerberos is the bit I worry about day to day, so how does it stack up?

I do think it covers the core concepts-as-a-user, and has a workflow diagram which presents time quite nicely. It avoids going in to those details of the protocol, which, as anyone who has ever read Colouris & Dolimore will note, is mindnumbingly complex and does hit the mathematics layer pretty hard. A good project for learning TLA+ would probably be "specify Kerberos"

ACLs are covered nicely too, while encryption covers HDFS, Linux FS and wire encryption, including the shuffle.

There's coverage of lots of the Hadoop stack, core Hadoop, HBase, Accumulo, Zookeeper, Oozie & more. There's some specifics on Cloudera bits: Impala, Sentry, but not exclusively and all the example configs are text files, not management tool centric: they'll work everywhere.

Overall then: a pretty thorough book on Hadoop security, for a general overview of security, Kerberos, ACLs and configuring Hadoop it brings together everything in to one place.

If you are trying to secure a Hadoop cluster, invest in a copy


Now, where is it limited?

1. A lot of the book is configuration examples for N+ services & audit logs. it's a bit repetitive, and I don't think anybody would sit down and type those things in. However, there are so many config files in the Hadoop space, and at least how to configure all the many services is covered. It just hampers the readability of the book.

2. I'd have liked to have seen the HDFS encryption mechanism illustrated, especially KMS integration. It's not something I've sat down to understand, and the same UML sequence diagram style used for Kerberos would have gone down.

3. It glosses over precisely how hard it is to get Kerberos working, how your life will be frittered away staring at error messages which make no sense whatsoever, only for you to discover later they mean "java was auto updated and the new version can't do long-key crypto any more". There's nothing serious in this book about debugging a Hadoop/Kerberos integration which isn't working.

4. Its bit on coding against Kerberos is limited to a couple of code snippets around UGI login and doAs. Given how much pain it it takes to get Kerberos to work client side, including ticket renewal, delegation token creation, delegation token renewal, debugging, etc, one and a half pages isn't even a start.

Someone needs to document Hadoop & Kerberos for developers —this book isn't it.

I assume that's a conscious decision by the authors, for a number of valid reasons
  • It would significantly complicate the book.
  • It's a niche product, being for developers within the Hadoop codebase.
  • It'd make maintenance of the book significantly harder.
  • To write it, you need to have experienced the pain of adding a new Hadop IPC, writing client tests against in-VM zookeeper clusters locked down with MiniKDC instances, or tried to debug why Jersey+SPNEGO was failing after 18 hours on test runs.
The good news is that I have experience the suffering of getting code to work on a secure Hadoop cluster, and want to spread that suffering more broadly.

For that reason, I would like to announce the work in progress, gitbook-toolchained ebook:

Kerberos and Hadoop: The Madness beyond the Gate

This is an attempt to write down things I've learned, using a Lovecraftian context to make clear this is forbidden knowledge that will drive the reader insane**. Which is true. Unfortunately, if you are trying to write code to work in a Hadoop cluster —especially YARN applications or anything acting as a service for callers, be they REST or IPC, you need to know this stuff.

It's less relevant for anyone else, though the Error Messages to Fear section is one of the things I felt the Hadoop Security book would have benefited from.

As noted, the Madness Beyond the Gate book is a WiP and there's no schedule to extend or complete it —just something written during test runs. I may finish it; I may get bored and distracted. But I welcome contributions from others, together we can have something which will be useful for those people coding in Hadoop —especially those who don't have the luxury of knowing who added Kerberos support to Hadoop, or has some security experts at the end of an email connection to help debug SPNEGO pain.

I've also put down for a talk on the same topic at Apachecon EU Data —let's see if it gets in.

(*) Flash removed except on Chrome browsers which I've had to go round and updated this week. The two-tier network is coming in once I set up a rasberry pi as the bridge, though with Ether-over-power the core backbone, life is tricky. And with PCs in the "trust zone", I'm still vulnerable to 0-days and the hazard imposed by other household users and my uses of apt-get, homebrew and maven & ivy in builds.I should really move to developing in VMs I destroy at the end of each week.

(**) plus it'd make for fantastic cover page art in an ORA book.


Why is so much of my life wasted waiting for test runs to complete?

I've spent the weekend enduring the pain of kerberos-related functional test failures, test runs that take time to finish, especially as its token expiry between deployed services which is the Source of Madness (copyright (c) 1988 MIT).

DSC_0128 - Nibaly

Anyone who follows me on Strava can infer when those runs take place as if its a long one, I've nipped down to the road bike on the turbo trainer and done a bit of exercise while waiting for the results.

Which is all well and good except for one point: why do I have to wait so long?

While a test is running, the different services in the cluster are all generating timestamped events, "log messages" as they usually known,  The code the test runner itself is also generating a stream of events, from any client-side code and wrapping JUnit/xUnit runners, again, tuples of (timestamp, thread, module, level, text) + implicitly (process, host). And of course there's the actual outcome of each test.

Why do I have to wait until the entire test run is completed for those results to appear?

There's no fundamental reason for that to be the case. It's just the way that the functional tests have evolved under the unit test runners, test runners designed to run short lived unit tests of little classes, runs where stdout and stderr were captured without any expectation of structured format. When <junit> completed individual test cases, it'd save the XML DOM build in memory to an XML file under build/tests. After Junit itself completed, the build.xml would have a <junitreport> task to map XML -> HTML in a wondrous piece of XSLT. 

Maven surefire does exactly the same thing, except it's build reporter doesn't make it easy to stream the results to both XML files and to the console at the same time.

The CI tooling: Cruise Control and its successors, of which Jenkins is the near-universal standard took those same XML reports and now generate their own statistics, and again wait for the reports to be generated at the end of the test run.

That means those of us who are waiting for a test to finish have a limited set of choices
  1. Tweak the logging and output to the console, stare at it waiting to see stack traces to go by
  2. Run a single failing test repeatedly until you fix it, again, staring at the output. In doing so you neglect the rest of the code until at the end of the day you are left with the choices of (a) run the hour long test of everything to make sure there are no regressions and (b) commit and push and expect a remote Jenkins to find the problem, at which point you may have broken a test and either need to get those results again & fix them, or rely on the goodwill of a colleage (special callout, Ted Yu, the person who usually ends up fixing SLIDER-1 issues)
Usually I drift into the single-test mode, but first you need to identify the problem. And even then, if the test takes a few minutes, each iteration hurts. And there's the hassle of collecting the logs, correlating events across machines and services to try and understand what's going on. If you want more detail, its over to http:{service host:port}/logLevel and tuning up the logs to capture more events on the next iteration, and so you are off again.

A hard-to-identify problem becomes a "very slow to identify problem", or productivity killer.

Sitting waiting for tests is a waste of time for software engineers.

What to do?

There's parallelisation. Apparently there's some parallelised test runner that the Cloudera team has which we could perhaps pick up and make reusable. That would be great, but you are still waiting for the end of the test runs for the results, unless you are going to ssh into the hosts and play tail -f against log files, or grep for specific event texts.

What would be just as critical is: real time reporting of test results.

I've discussed before how we need to radically improve tests and test runners.

What we also have to recognise is that the test reporting infrastructure equally dates from the era of unit tests taking milliseconds, full test suites and XSL transformations of the results taking 10s of seconds at most.

The world has moved on from unit tests.

What do I want now? As well as the streaming out of those events in structured form directly to the some aggregrator, I want that test runner to be immediately publishing the aggregate event stream and test results to some viewer better than four consoles with tail -f streaming text files (or worse, XML reports). I want HTML pages as they come in, with my test report initially showing all tests enumerated, then filling up as tests run and fail. I'd like the runner to known (history, user input?) which tests were failing, and so run them first. If I checked in a patch to a specific test class, that'll be the one I want to run next, followed by everything else in the same module (assuming locality of test coverage).

Once I've got this, the CI tooling I'm going to run will change. It won't be a central machine or pool of them, it'll be a VM hosted locally or some cloud infrastructure. Probably the latter, so it won't be fighting for RAM and CPU time with the IDE.

Whenever I commit and push a patch to my private branch, the tests should run.

It's my own personal CI instance, it gets to run my tests, and I get to have a browser window open keeping track of the status while I get on with doing other things.

We can do this: its just the test runner reporting being switched from batch to streaming, with the HTML views adapting.

If we're building the largest distributed computing systems on the planet, we can't say that this is beyond us.

(Photo: Nibali descending from the Chartreuse Massif into Grenoble; Richie Porte and others just behind him doomed to failure on the climb to Chamrousse, TdF 2014 stage 13)


The Manchester Dataflow Machine: obscure historical computer architecture of the month

Milind has flagged up some nearby students at Bristol Uni attempting to reimplement the Transputer. I should look at that some time. For now, some little paper I was reading last week while frittering away an unexpected few days in a local hospital.

The Manchester Dataflow Machine

The MDM was some mid-1980s exploration of a post-microprocessor architecture, the same era as RISC, the Transputer and others. It was built from 74F-series logic, TTL rather than CMOS; performance numbers aren't particularly compelling by today's standard. What matters more is its architectural model.

In a classic von-Neumann CPU design, there's a shared memory for data and code; the program counter tells the CPU where to fetch the next instruction from. It's read, translated into an operation and then executed. Ops can work with memory, registers exist as an intermediate stage, essentially an explicit optimisation to deal with memory latency, branching implemented by changing the PC. The order of execution of operations is defined by the order of machine code instructions (for anyone about to disagree with me there: wait; we are talking pure von-Neumann here). It's a nice simple conceptual model, but has some flaws. A key one is that some operations take a long time (memory reads if there a cache misses, some arithmetic operations (example: division). The CPU waits, "stalls" until the operation completes, even if there is a pipeline capable of executing different stages of more than one operation at a time.

What the MDM did was accept the inevitability of those delays -and then try to eliminate them by saying "there is no program counter, there are only data dependencies".

Instead of an explicitly ordered sequence of instructions, your code lists operations a unset or binary operations against the output of previous actions and/or fetches from memory. The CPU then executes those operations in a order which guarantees those dependencies are met, but where the exact order is chosen based on the explicit dependency graph, not the implicit one of the sequence of opcodes produced by the compiler/developer.

Implementation-wise, they had a pool of functional units, capable of doing different operations, wired up by something which would get the set of instructions from (somewhere), stick them in the set of potential operations, and as slots freed up in the functional units, dispatch operations which were ready. Those operations generated results, which would make downstream operations ready for dispatch.

The Manchester Dataflow Machine Architecture

This design offered parallel execution proportional to the number of functional units: add more adders, shifters, dividers and they could be kept busy. Memory IO? Again, a functional unit could handle a read or a write, though supporting multiple units may be trickier. Otherwise, the big limitation on performance comes in conditional branching: you can't schedule any work until you know it's conditions are met. Condition evaluation, then, becomes a function of its own, with all code that comes after dependent on the specific outcomes of the condition.

To make this all usable, a dataflow language was needed; the one the paper talks about is SISAL. This looks like a functional language, one designed to compile down to the opcodes a dataflow machine needs.

Did it work? yes. Did it get adopted? No. Classic procedural CPUs with classic procedural HLLs compiling down to assembly language where the PC formed an implicit state variable is what won out. It's how we code and what we code for.

And yet, what are we coding: dataflow frameworks and applications at the scale of the datacentre. What are MapReduce jobs but a two step dataflow? What is Pig but a dataflow language? Or Cascading? What are the query plans generated by SQL engines but different data flow graphs?

And if you look at Dryad and Tez, you've got a cluster-wide dataflow engine.

At the Petascale, then, we are working in the Dataflow Space.

What we aren't doing is working in that model in the implementation languages. Here we write procedural code that is either converted to JVM bytecodes (for an abstract register machine), or compiled straight down to assembly language? And those JVM bytecodes: down to machine code at runtime. What those compilers can do is reorder the generated opcodes based on the dataflow dependency graph which it has inferred from the source. That is, even though we went and wrote procedurally, the compilers reversed the data dependencies, and generated a sequence of operations which it felt were more optimal, based on its knowledge of and assumptions about the target CPU and the cache/memory architecture within which it resides.

And this is the fun bit, which explains why the MDM paper deserves reading: the Intel P6 CPU of the late 1990s —as are all its successors- right at the heart, built around a dataflow model. They take those x86 opcodes in the order lovingly crafted by the compiler or hard-core x86 assembler coder and go "you meant well, but due to things like memory read delays, let us choose a more optimal ordering for your routines". Admittedly, they don't use the MDM architecture, instead they use Tomasulo's algorithm from the IBM 360 mainframes

A key feature there is "reservation stations", essentially register aliasing, addressing the issue that Intel parts have a limited and inconsistent set of registers. If one series of operations work on registers eax and ebx and a follow-on sequence overwrites those registers, the second set gets a virtual set of registers to play with. Hence, it doesn't matter if operations reuse a register, the execution order is really that of the data availability. The other big trick: speculative execution.

The P6 and successor parts will perform operations past a branch, provided the results of the operations can fit into (aliased) registers, and not anything with externally visible effects (writes, port IO, ...). The CPU tags these operations as speculative, and only realises them when the outcome of the branch is known. This means you could have a number of speculated operations, such as a read and a shift on that data, with the final output being visible once the branch is known to be taken. Predict the branch correctly and all pending operations can be realised,; any effects made visible. To maintain the illusion of sequential non-speculative operation, all operations with destinations that can't be aliased away have to be blocked until the branch result is known. For some extra fun, any failures of those speculated operations can only be raised when the branch outcome is known. Furthermore, it has to be the first failing instruction in the linear, PC-defined sequence that must visibly fail first, even if an operation actually executed ahead of it had failed. That's a bit of complexity that gets glossed over when the phrase "out of order execution" is mentioned. More accurate would be "speculative data-flow driven execution with register aliasing and delayed fault realisation".

Now, for all that to work properly, that data flow has to be valid: dependencies have to be explicit. Which isn't at all obvious once you have more than one thread manipulating shared data, more than one CPU executing operations in orders driven by its local view of the data dependencies.

Initial state
int p=0;
int ready = 0;
int outcome=100;
int val = 0;

Thread 1
p = &outcome;
ready = 1;

Thread 2
if (ready) val = *p;

Neither thread knows of the implicit dependency of p only being guaranteed to be valid after 'ready' is set. if the deference val = *p  is speculatively executed before the condition if (ready)is evaluated, then instead of ready==true implying val == 100, you could now have a stack traces from attempting to read the value at address 0. This will of course be an
obscure and intermittent bug which will only surface in the field in many-core systems, and never under the debugger.

The key point is: even if the compiler or assembly code orders things to meet your expectations, the CPU can make its own decisions. 

The only way to make your expectations clear is by getting the generated machine code to contain flags to indicate the happens-before requirements, which, when you think about it, is simply adding another explicit dependency in the data flow, a must-happen-before operator in the directed graph. There isn't an explicit opcode for that, barrier opcode goes in which tell the CPU to ensure that all operations listed in the machine code before that op will complete before the barrier. Equally importantly, that nothing will be reordered or speculatively executed ahead of it: all successor operations will then happen after. That is, the op code becomes a dependency on all predecessor operations, all that come after have the must-come-after dependency on this barrier. In x86, any operation with the LOCK attribute is a barrier, as are others (like RDTSCP). And in Java, the volatile keyword is mapped to a locked read or write, so is implicitly a barrier. No operations will be promoted ahead of the a volatile R/W, either by Javac, or by the CPU, nor will any be delayed. This means volatile operations can be very expensive, as if you have a series of them, even if there is no explicit data-dependency, they will be executed in-order. It also means that at compile-time, javac will not move operations on volatile fields out of a loop, even if there's no apparent update to them.

Given these details on CPU internals, it should be clear that we now have dataflow at the peta-scale, and at the micro-scale, where what appear to be sequential operations have their data dependencies used to reorder things for faster execution. It's only the bits in the middle that are procedural. Which is kind of ironic really: why can't it be dataflow all the way down? Well the MDM offered that, but nobody took up the offering.

Does it matter? Maybe not. Except that if you look at Intel's recent CPU work, it's adding new modules on the die for specific operations. First CRC, then AES encryption -and the new erasure coding in HDFS work is using some other native operations. By their very nature, these modules are intended to implement in Si algorithms which take many cycles to process per memory access. Which means they are inherently significantly slower than existing functional units in the CPU. Unless the hardware implementations are as fast as operations like floating point division, code that depends on the new operations' results are going to be held up for a while. Because all that OOO dataflow work is hidden, there's no way in the x86 code to send that work off asynchronously.

It'd be interesting to consider whether it would be better to actually have some dataflow view of those slow operations, something like Java's futures, where slow operations are fired off asynchronously, with a follow-up operation to block until the result of the operation is ready -with any failures being raised as this point. Do that and you start to give the coders and compiler writers visibility into where big delays can surface, and the ability to deal with them, or at least optimise around them.

Of course, you do need to make that stuff visible in the language too

Presumably someone has done work like that; I'm just not current enough with my reading.

Further Reading
[1] How Java is having its memory model tightened
[2] How C++ is giving you more options to do advanced memory get/set


3 years at Hortonworks!

In 2012 I handed in my notice at HP Laboratories and joined Hortonworks : this May is the third anniversary of my joining the team.


I didn't have to leave HP, and in the corporate labs I has reasonable freedom to work on things I found interesting. Yet is was through those interesting things that we'd discovered Hadoop. Paolo Castagna introduced me to it, as he bubbled with enthusiasm for what he felt was he future of server side computing. At the time I was working on the problem of deploying and managing smaller systems -but doing so in the emergent cloud infrastructures. Hadoop was initially another interesting deployment problem: one designed to scale and cope with failures, yet also built on the assumption of a set of physical hosts, hosts with fixed names and addresses, hosts with persistent storage and whose  failures would be independent. some of the work I did at that time with Julio Guijarro included dynamic Hadoop clusters, Mombasa (the long haul route to see elephants). The work behind the scenes to give Hadoop services more dynamic deployments, HADOOP-3628, earned me Hadoop committership. While the branch was never merged in, the YARN service model shows its heritage.

While we were doing this, HP customers were also discovering Hadoop —and building their clusters. I remember the first email coming in from a sales team who had been asked to quote the terasort performance of their servers: the sales team hadn't heard of a terasort and didn't know what to do. We helped. Before long we were the back-end team on many of the big Hadoop deals, helping define and review the proposed hardware specs, reviewing and sometimes co-authoring the bid responses. And what bids they were! At first I thought a petabyte was a lot of storage —but soon some of the deals were for 10+, 20+ PB. Projects where issues like rack weight and HDD resonance were as key to worry about as power and logistics of getting the servers delivered. Production lines which needed to be block booked for a week or two, but in doing so allowing server customisation: USB ports surplus? Skip them. How many CPU sockets to fill-and with what SKU? Want 2.5" laptop HDDs for bandwidth over 3.5" capacity oriented storage? All arrangeable, with even the option of a week-long burn in and benchmark session as an optional extra. This would show that the system worked as requested, including setting benchmarks for sorting 5+ PB of data that would never be published out of fear of scaring people (bear that in mind when you read blogs posts showing how technology X out-terasorts Hadoop —the really big Hadoop sort numbers are of 10+ PB on real clusters, not EC2 XXL SSD instances, and they don't get published).

These were big projects and it was really fun to be involved.

At the same time though, I felt that HP was missing the opportunity, the big picture. The server group was happy to sell the systems for x86 system margins, other groups to set them up. But where was the storage group? Giving us HPL folk grief for denying them the multi-PB storage deals —even though they lacked a Hadoop story and didn't seem to appreciate the technology. Networking? Doing great stuff for HFT systems where buffering was anathema; delivering systems for the enterprise capable of handling intermittent VM migration. But not systems optimised for sustained full link rate bandwidth, decent buffering and backbone scalability through technologies like TRILL or Shortest Path Bridging (you can get these now, BTW).

The whole Big Data revolution was taking place in front of HP: OSS software enabling massive scale storage and compute systems, the underlying commodity hardware making PB storeable, and the explosion in data sources giving the data to work with. And while HP was building a significant portion of the clusters, it hadn't recognised that this was a revolution. It was reminiscent of the mid 1990s, when the idea of Web Servers was seen as "just another use of a unix workstation".

I left to be part of that Big Data revolution, joining the team I'd got to know through the OSS development, Hortonworks, and so defining the future, rather than despair about HP's apparent failure to recognise change. Many of us from that era left: Audrey and I to Hortonworks, Steve and Scott to RedHat, Castagna to Cloudera. Before I get complaints from Julio and Chris, —yes some of the first generation of Hadoop experts are still there, the company is taking Big Data seriously, and there are now many skilled people working on it. Just not me.

What have I done in those three years? Lots of things! Some of the big ones include:
  • Hadoop 1 High Availability. One of the people I worked with at VMWare, Jun Ping, is now a valued colleague of mine.
  • OpenStack support: Much of the hadoop-openstack code is mine, particularly the tests.
  • The Hadoop FS Specification: defining a Python-like syntax for Spivey's Z notation, delving through the HDFS and Hadoop source to really define what  a Hadoop filesystem is expected to do. From the OpenStack Swift work I'd discovered the unwritten assumptions & set out to define them, then build a test suite to help anyone trying to integrate their FS or object store with Hadoop to get started. This was my little Friday afternoon project; nobody asked me to do it -but now that it is there it's proven invaluable in getting the s3a S3 client working, as well as being one of the first checkpoints for anyone who wants to get Hadoop to work on other filesystems. Arguably that helps filesystem competitors —yet what it is really meant to do is give users a stable underpinning of the filesystem, beyond just the method signatures.
  • The YARN-117 service model. I didn't start that work, I just did my best to get the experience of the SmartFrog and HADOOP-3628 service models in there. I do still need to document it better, and get the workflow and service launcher into the core code base; Slider is built around them.
  • Hoya: proof of concept YARN application to show that HBase was deployable  as a dynamic YARN application, and initial driver for the YARN-896 services-on-YARN work.
  • Apache Slider (incubating). A production quality successor to Hoya, combining the lessons from it with the Ambari agent experience, producing an engine to make many applications deplorable on YARN though a minimal amount of Python code. Slider is integrate with Ambari, but it works standalone against ASF Hadoop 2.6 and the latest CDH 5.4 release (apparently). I've really got a good insight into the problems of placement of work where access to data has to be balanced with failure resilience; enough to write a paper if I felt like it —rather than just a blog post.
  • The YARN Service Registry. Again, something I need to explain more. An HA registry service for Hadoop clusters, where static and dynamic applications can be registered and used. Slider depends on it for client applications to find Slider and its deployed services; it is critical for internal bonding in the presence of failures. It's also the first bit of core Hadoop with a formal specification in TLA+.
  • Spark on YARN enhancements. SPARK-1537 is my first bit of work there, having the spark history server use the YARN timeline service. Spark internals in Scala, collaboration with the YARN team on REST API definitions and reapplying the test experience of Slider to accompany this with quality tests.
  • Recently: some spare time work mentoring S3a: into a production ready state.
  • Working with colleagues to help shape our vision of the future of Hadoop. Apache Hadoop is a global OSS project, one which colleagues, competitors and users of the technology collaborate to build. I, like the rest of my colleagues get a say there, helping define where we think it can go: then building it.

The latter is a key one to call out. At HP an inordinate amount of my time was spent trying to argue the case for things like Hadoop inside the company itself, mostly by way of PowerPoint-over-email. I don't have to do that any more. When we make decisions it's s done rapidly,  pulling in the relevant people, rather than the inertial hierarchy of indifference and refusal which I sometimes felt I'd encountered in HP.

Which is why working at Hortonworks is so great: I'm working with great people, on grand projects —yet doing this a process where my code is in people's hands within weeks to months, and where an agile team keeps the company nimble. and pretty much all my work has shipped.

If you look at how the work has included applied formal methods, distributed testing, models of system failure and dynamic service deployment, I'm combining production software development with higher level work that is no different than what I was doing in a corporate R&D lab -except with shipping code.

Hortonworks is hiring. If what I've been up to -and how I've been doing it- sounds exciting: then get in touch. That particularly applies to my former HPL colleagues, who have to make their mind up where to go: ink vs enterprise. There is another option: us.


It's OK to submit patches without tests: just show the correctness proofs instead


In a window adjacent to the browser I'm typing this, I'm doing a clean build of trunk prior to adding a two line patch to Hadoop -and the 20+ lines needed to test that patch; Intellij burbles away for 5 minutes as it does on any changes to the POMs on a project which has a significant of the hadoop stack loaded.

The patch I'm going write is to fix a bug introduced by a previous three line patch, one that didn't come with patches because it was "trivial".

It may have been a trivial patch, but it was a broken trivial patch. It's not so much that the test cases would have found this, but they'd have forced the author to think more about the inputs to the two-line method, and what outputs would be expected. Then we get some tests that generate the desired outputs for the different outputs, ones that guarantee that over time the behaviour is constant.

Instead we have a colleague spending a day trying to track down a remote functional test run, one that has been reduced to a multi-hop stack trace problem. The big functional test suites did find that bug (good), but because the cost of debugging and isolating that failure is higher handling that failure is more expensive.; With better distributed test tooling, especially log aggregation and analysis, that cost should be lower —but it's still pretty needless for something that could have been avoided simply by thinking through the inputs.

Complex functional system tests should not be used as a substitute for unit tests on isolated bits of code. 

I'm not going to highlight the issue, or identify who wrote that patch, because it's not fair: it could be any of us, and I am just as guilty of submitting "trivial" patches. If something is 1-2 lines long, it's really hard to justify in your brain the effort of writing the more complex tests to go with it.

If the code actually works as intended, you've saved time and all is well. But if it doesn't, that failure shows up later in full stack tests (cost & time), the field (very expensive), and either way ends up being fixed the way the original could have been done.

And as documented many times before: it's that thinking about inputs & outputs that forces you to write good code.

Anyway: I have tests to write now, before turning on to what is the kind of problem where those functional tests are justified, such as Jersey client not working on Java 8. (I can replicate that in a unit test, but only in my windows server/java 8 VM)

In future, if I see anyone use "trivial patch" as a reason to not write tests, I'll be wheeling out the -1 veto.

I do however, offer an exception: if people can prove their code works, I'll be happy

(photo: wall in Brussels)


Distributed System Testing: where now, where next?

Confluent have announced they are looking for someone to work on an open source framework for distributed system testing.

I am really glad that they are sitting down to do this. Indeed, I've thought about sitting down to do it myself, the main reason I've been inactive there is "too much other stuff to do".


Distributed System Testing is the unspoken problem of Distributed Computing. In single-host applications, all you need to do is show that the application "works" on the target system, with its OS,  enviroment (timezone, locale, ...), installed dependencies and application configuration.

In modern distributed computing you need to show that the distributed application works across a set of machines, in the presence of failures.

Equally importantly: when your tests fail, you need the ability to determine why they failed.

I think there is much scope to improve here, as well as the fundamental problem: defining works in the context of distributed computing.

I should write a post on that in future. For now, my current stance is: we need stricter specification of desired observable behaviour and implementation details. While I have been doing some Formal Specification work within the Hadoop codebase, there's a lot more work to be done there —and I can't do it all myself.

Assuming that there is a good specification of behaviour, you can then go on to defining tests which observe the state of the system, within the configuration space of the system (now including multiple hosts and the network), during a time period in which failures occur. The observable state should continue to match the specification, and if not, you want get the logs to determine why not. Note here that "observed state" can be pretty broad, and includes
  • Correct processing of requests
  • The ability to serialize an incoming stream of requests from multiple clients (or at least, to not demonstrate non-serialized behaviour)
  • Time to execute operations is one (performance),
  • Ability to support the desired request rate (scalability)
  • Persistence of state, where appropriate
  • Reslience to failures of : dependent services, network, hosts, 
  • Reporting of detected failure conditions to users and machines (operations needs)
  • Ideally: ability to continue in the presence of byzantine failures. Or at least detect them and recover.
  • Ability to interact with different versions of software (clients, servers, peers)
  • Maybe: ability to interact with different implementations of the same protocol.
I've written some slides on this topic, way back in 2006, Distributed Testing with SmartFrog. There's even a sub-VGA video to go with it from the 2006 Google Test Automation Conference.

My stance there was
  1. Tests themselves can be viewed as part of a larger distributed system
  2. They can be deployed with your automated deployment tools, bonded to the deployed system via the configuration management infrastructure
  3. You can use the existing unit test runners as a gateway to these tests, but reporting and logging needs to be improved.
  4. Data analysis is a critical area to be worked on.
I didn't look at system failures, I don't think I was worry enough about that, showing we weren't deploying things at scale, and before cloud computing took failures mainstream. Nowadays nobody can avoid thinking about VM loss at the very least.

Given I did those slides nine years ago, have things improved? Not much, no
  • Test runners are still all generating the Ant XML test reports written along with the matching XSLT transforms up by Stephane Balliez in 2000/2001
  • Continuous Integration servers have got a lot better, but even Jenkins, wonderful as it is, presents results as if they were independent builds, rather than a matrix of (app, environment, time). We may get individual build happiness, but we don't get reports  by test, showing that Hadoop/TestATSIntegrationFailures is working intermittently on all debian systems -but has been reliable elsewhere. The data is all there, but the reporting isn't.
  • Part of the problem is that they are still working with that XML format, one that, due to its use of XML attributes to summarise the run, buffers things in memory until the test test case finishes, then writes out the results. stdout and stderr may get reported -but only for the test client, and even then, there's no awareness of the structure of log messages
  • Failure conditions aren't usually being explicitly generated. Sometimes they happen, but then its complaints about the build or the target host being broken.
  • Email reports from the CI tooling is also pretty terse. You may get the "build broken, test XYZ with commits N1-N2", but again, you can get one per build, rather than a summary of overall system health.
  • With a large dependent graph of applications (hello, Hadoop stack!), there's a lot of regression testing that needs to take place —and fault tracking when something downstream fails. 
  • Those big system tests generate many, many logs, but they are often really hard to debug. If you haven't spent time with 3+ windows trying to sync up log events, you've not been doing test runs.
  • In a VM world, those VMs are often gone by the time you get told there's a problem.
  • Then there's the extended life test runs, the ones where we have to run things for a few days with Kerberos tokens set to expire hourly, while a set of clients generate realistic loads and random servers get restarted.
Things have got harder: bigger systems, more failure modes, a whole stack of applications —yet testing hasn't kept up.

In slider I did sit down to do something that would work within the constraints of the current test runner infrastructure yet still let us do functional tests against remote Hadoop clusters of variable size . Our functional test suite, funtests, uses Apache Bigtop's script launcher to start  Slider via its shell/py scripts. This tests those scripts on all test platforms (though it turns out, not enough locales), and forces us to have a meaningful set of exit codes —enough to distinguish the desired failure conditions from unexpected ones. Those tests can deploy slider applications on secure/insecure clusters (I keep my VM configs on github, for the curious), deploy test containers for basic operations, upgrade test, failure handling tests. For failure generation our IPC protocol includes messages to kill a container, and to have the AM kill itself with a chosen exit code.

For testing slider-deployed HBase and accumulo we go one step further. Slider deploys the application, and we run the normal application functional test suites with slider set up to generate failures.

How do we do that? With the Slider Integral Chaos Monkey. That's something which can run in the AM, and, at a configured interval, roll some virtual dice to see if the enabled failure events should be triggered: currently container and AM (we make sure the AM isn't chosen in the container kill monkey action, and have a startup delay to let the test runs settle in before starting to react).

Does it work? Yes. Which is good, because if things don't work, we've got the logs of all the machines in the cluster that ran slider to go through. Ideally, YARN-aggregated logs would suffice, but not if there's something up between YARN and the OS.

So: test runner I'm happy with. Remote deployment, failure injection, both structured and random. Launchable from my deskop and CI tooling; tests can be designed to scale. For testing rolling upgrades (Slider 0.80-incubating feature), we run the same client app while upgrading the system. Again: silence is golden.

Where I think much work needs to be done is what I've mentioned before: the reporting of problems and the tooling to determine why a test has failed.

We have the underlying infrastructure to stream logs out to things like HDFS or other services, there's nothing to stop us writing code to collect and aggregate those -with the recipient using the order of arrival to place an approximate time on events (not a perfect order, obviously, but better than log events with clocks that are wrong). We can collect those entire test run histories, along with as much environment information that we could grab and preserve. Junit: system properties. My ideal: VM snapshots & virtual network configs.

Then we'd go beyond XSLT reports of test runs and go to modern big data analysis tools. I'm going to propose here: Spark Why? so you can do local scripts, things in Jenkins & JUnit, and larger bulk operations. And for that get-your-hands-dirty test-debug festival, I can use a notebook like Apache Zepplin (incubating) can then be a no

we should be using our analysis tools for the automated analysis and reporting of test runs, the data science tooling for the debugging process.

Like I said, I'm not going to do all this. I will point to a lovely bit of code by Andrew Or @ databricks, spark-test-failures. which gets Jenkins's JSON-formatted test run history, determines flaky tests and posts the results on google docs. That's just a hint of what is possible —yet it shows the path forwards.

(Photo: "crayola", St Pauls:work commissioned by the residents)