Well, what about Groovy then?

Following on from my comments on Scala, -what about Apache Groovy?

Stokes Croft Graffiti, Sept 2015

It's been a great language for writing tests in, especially when you turn on static compilation, but there are some gotchas.
  1. It's easy to learn if you know Java, so the effort of adding it to a project is straightforward.
  2. It's got that auto log.info("timer is at $timer") string expansion. But the rules of when they get expanded are 'complex'.
  3. Lists and Maps are in the language. This makes building up structures for testing trivially easy.
  4. The maven groovy task is a nice compiler for java and groovy
  5. Groovydoc is at least 10x faster than javadoc. It makes you realise that javadoc hasn't had any engineering care for decades; right up there with rmic.
  6. Its closeness to Java makes it easy to learn, and you can pretty much paste Java code into groovy and have it work.
  7. The groovy assert statement is best in class. If it fails, it deconstructs the expression, listing the toString() value of every parameter. Provided you provide meaningful string values, debugging a failed expression is way easier than anything else.
  8. Responsive dev team. One morning some maven-always-updates artifact was broken, so a stack trace followed the transition of countries into the next day. We in Europe hit it early and filed bugs -it was fixed before CA noticed. Try that with javac-related bugs and you probably won't have got as far as working out how to set up an Oracle account for bug reporting before the sun sets and the video-conf hours begin.
As I said, I think it's great for testing. Here's some test of mine

@CompileStatic     // we want this compiled in advance to find problems early
@Slf4j             // and inject a 'log' field for logging

class TestCommonArgParsing implements SliderActions, Arguments {
  public void testList1Clustername() throws Throwable {
    // note use of [ list ]
    ClientArgs ca = createClientArgs([ACTION_LIST, 'cluster1'])
    assert ca.clusterName == 'cluster1'
    assert ca.coreAction instanceof ActionListArgs
What don't I like?
  1. Different scoping rules: everything defaults to public. Which you have to remember when switching languages
  2. Line continuation rules need to be tracked too ... you need the end of the line to be unfinished (e.g trail the '+' sign in a string concatenation). Scala makes a better job of this.
  3. Sometimes IDEA doesn't pick up use of methods and fields in your groovy source, so the refactorings may omit bits of your code.
  4. Sometimes that @CompileStatic tag results in compilation errors, normally fixed by commenting out that attribute. Implication: the static and dynamic compiler are 'different'
  5. Using == for .equals() is again, danger.
  6. The notion of truthyness in comparisons is very C/C++-ish: you need to know the rules. All null values are false, as are integer values that are zero And strings which are empty. Safe strategy: just be explicit.
  7. It's not so much type inference as type erasure, with runtime stacks as the consequence.
  8. The logic about when strings are expanded can sometimes be confusing.
  9. You can't paste from groovy to java without major engineering work. At least you can —more than you could pasting from Scala to Java— but it makes converting test code to production harder.
Here's an example of something I like. This is possible in Scala, Java 8 will make viable-ish too. It's a test which issues a REST call to get the current slider app state, pushes out new value and then spins awaiting a remote state change, passing in a method as the probe parameter.

public void testFlexOperation() {
    // get current state
    def current = appAPI.getDesiredResources()

    // create a guaranteed unique field
    def uuid = UUID.randomUUID()
    def field = "yarn.test.flex.uuid"
    current.set(field, uuid)
    repeatUntilSuccess("probe for resource PUT",
        5000, 200,
            "key": field,
            "val": uuid
        "Flex resources failed to propagate") {
      def resolved = appAPI.getResolvedResources()
      fail("Did not find field $field=$uuid in\n$resolved")

  Outcome probeForResolveConfValues(Map args) {
    String key = args["key"]
    String val  = args["val"]
    def resolved = appAPI.getResolvedResources()
    return Outcome.fromBool(resolved.get(key) == val)

That Outcome class has three values: Success, Retry and Fail; the probe returns an Outcome and the executor, repeatUntilSuccess execs the probe closure until the Duration of retries exceeds the timeout, the probe succeeds, or a Fail response triggers a fail fast. It allows my tests to iterate until success, but if that probe can detect an unrecoverable failure —bail out fast. It's effective at avoiding long sleep() calls, which introduce needless delays to fast systems, and which are incredibly brittle in slow ones.

If you look at a lot of Hadoop test failures, they're of the form "test timeout on Jenkins", with a fix of "increase sleep times". Closure-based probes, with probes that detect unrecoverable failures (e.g. network unreachable is a hard failure, different from 404, which may go away on retries). Anyway: closures are great in tests. Not just for probes, but for functions to call once a system is in a desired state (e.g. web server launched).

We use Groovy in Slider for testing; the extra productivity and those assert statements are great. So does Bigtop —indeed, we use some Bigtop code as the basis for our functional tests. But would I advocate it elsewhere? I don't know. It's close enough to Java to co-exist, whereas I wouldn't advocate using Scala, Clojure or similar purely for testing Java code *within the same codebase*.

There's also Java 8 to consider. It is a richer language, gives us those closures, just not the maps or the lists. Or the assertions. Or the inline string expansion. But...it's very close to production code, even if that production code is java-7 only -and you can make a very strong case for learning Java 8. Accordingly,

  1. I'm happy to continue with Groovy in Slider tests.
  2. For an app where the production code was all Java, I wouldn't advocate adopting groovy now: Java 8 offers enough benefits to be the way forward.
  3. If anyone wants to add java-8 closure-based tests in Hadoop trunk, I'd be supportive there too.


Prime numbers —you know they make sense

The UK government has just published its new internet monitoring bill.

Sepr's Mural and the police

Apparently one of the great concessions is they aren't going to outlaw any encryption they can't automatically decrypt. This is not a concession: someone has sat down with them and explained how RSA and elliptic curve crypto work, and said "unless you have a ban on prime numbers > 256 bits, banning encryption is meaningless". What makes the government think they have any control over what at-rest encryption code gets built into phone operating systems or between mobile apps written in California. They didn't have a chance of making this work; all they'd do is be laughed at from abroad while crippling UK developers. As an example, if releasing software with strong encryption were illegal, I'd be unable to make releases of Hadoop —simply due to HDFS encryption.

You may as well assume that nation states already do have the abilities to read encrypted messages (somehow), and deal with that by "not becoming an individual threat to a nation state". Same goes for logging of HTTP/S requests. If someone really wanted to, they could. Except that until now the technical abilities of various countries had to be kept a secret, because once some fact about breaking RC4 or DH forward encryption becomes known, software changes.

What is this communications bill trying to do then? Not so much legalise what's possible today, but give the local police forces access to similar datasets for everyday use. That's probably an unexpected side-effect of the Snowden revelations: police forces round the country saying "ooh, we'd like to know that information", and this time demanding access to it in a way that they can be open about in court.

As it stands, it's mostly moot. Google knows where you were, what your search terms were and have all your emails, not just the metadata. My TV vendor claims the right to log what I watched on TV and ship it abroad, with no respect for safe-harbour legislation. As for our new car, it's got a modem built in and if it wants to report home not just where we've been driving but whether the suspension has been stressed, ABS and stability control engaged, or even what radio channel we were on, I would have no idea whatsoever. The fact that you have to start worrying about the INFOSEC policies of your car manufacturer shows that knowing which web sites you viewed is becoming less and less relevant.

Even so, the plan to record 12 months of every IP address's HTTP(S) requests, (and presumably other TCP connections) is a big step change, and not one I'm happy about. It's not that I have any specific need to hide from the state —and if I did, I'd try tunnelling through DNS, using Tor, VPNing abroad, or some other mechanism. VPNs are how you uprate any mobile application to be private —and as they work over mobile networks, deliver privacy on the move. I'm sure I could ask a US friend to set up a raspberry pi with PPTP in their home, just as we do for BBC iPlayer access abroad. And therein lies a flaw: if you really want to identify significant threats to the state and its citizens, you don't go on about how you are logging all TCP connections, as that just motivates people to go for VPNs. So we move the people that are most dangerous underground, while creating a dataset more of interest to the MPAA than anyone else.

[Photo: police out on Stokes Croft or Royal Wedding Day after the "Second Tesco Riot"]


Concept: Maintenance Debt

We need a new S/W dev term Maintenance Debt.

Searching for this, the term shows it only crops up in the phrase Child Maintenance Debt -we need it for software.

OR Mini Loop Tour

Maintenance Debt: the Technical Debt a project takes on when they fork an OSS project.

All OSS code offers you the freedom to make a fork of the code; the right to make your own decisions as to the direction of the code. Git makes that branching a one line action git branch -c myfork.

Once you fork, have taken on the Maintenance Debt of that fork. You now have to:
  1. Build that code against the dependencies of your choice.
  2. Regression test that release.
  3. Decide which contributions to the main branch are worth cherry-picking into your fork.
  4. If you do want to cherry pick, adapting those patches to your fork. The more you change your fork, the
    more the cost of adaptation.
  5. Features you add to your branch which you don't contribute back become yours to maintain forever.
  6. If the OSS code adds a similar feature to yours, you are faced with the choice of adapt or ignore. Ignore it and your branch is likely to diverge fast and cherry-picking off the main branch becomes near-impossible.
That's what Maintenance Debt is, then: the extra work you add when deciding to fork a project. Now, how to keep that cost down?
  1. Don't fork.
  2. Try and get others to add the features for you into the OSS releases.
  3. As people add the features/fixes you desire, provide feedback on that process.
  4. Contribute tests to the project to act as regression tests on the behaviours you need. This is a good one as with the tests inside a project, their patch and jenkins builds will catch the problem early, rather than you finding them later.
  5. If you do write code, fixed and features, work to get them in. It's not free; all of us have to put in time and effort to get patches in, but you do gain in the long term.
  6. Set up your Jenkins/CI system so that you can do builds against nightly releases of the OSS projects you depend on (Hadoop publishes snapshots of branch 2 and trunk for this). Then complain when things break.
  7. Test beta releases of your dependencies, and the release candidates, and complain when things break. If you wait until the x.0 releases, the time to get a fix out is a lot longer —or worse, someone can declare that a feature is now "fixed" and cannot be reverted.

If you look at that list, testing crops up a lot. That's because compile and test runs are how you find out regressions. Even if you offload the maintenance debt to others, validating that their work meets your needs is a key thing to automate.

Get your regression testing in early.

[photo: looking west to the Coastal Range, N/W of Rickreall, Willamette Valley, OR, during a 4 day bike tour w/ friends and family]



Stokes Croft Graffiti, Sept 2015

I've been a busy bunny writing what has grown into a fairly large Spark patch: SPARK-1537, integration with the YARN timeline server. What starts as a straightforward POST event, GET event list, GET event code, grows once you start taking into account Kerberos, transient failures of the endpoints, handling unparseable events (fail? Or skip that one?), compatibility across versions. Oh, and testing all of this; I've got tests which spin up the YARN ATS and the Spark History server in the same VM, either generate an event sequence and verify it all works -or even replay some real application runs.

And in the process I have learned a lot of Scala and some of the bits of spark.

What do Iike?
  • Type inference. And not the pretend inference of Java 5 or groovy
  • The match/case mechanism. This maps nicely to the SML case mechanism, with the bonus of being able to add conditions as filters (a la Erlang).
  • Traits. They took me while to understand, until I realised that they were just C++  mixins with a structured inheritance/delegation model. And once so enlightened, using them became trivial. For example, in some of my test suites, the traits you mix in define what it is services bring up for the test cases.
  • Lists and maps as primary language structures. Too much source is frittered away in Java creating those data structures.
  • Tuples. Again, why exclude them from a language?
  • Getting back to functional programming. I've done it before, see.

What am I less happy about?
  • The Scala collections model. Too much, too complex.
  • The fact that it isn't directly compatible with Java lists and maps. Contrast with Groovy.
  • Scalatest. More the runner than the tests, but the ability to use arbitrary strings to name a test case, means that I can't run (at least via maven) a specific test case within a class/suite by name. Instead I've been reduced to commenting out the other tests, which is fundamentally wrong. 
  • I think it's gone overboard on various features...it has the, how do I say it, C++ feel.
  • The ability to construct operators using all the symbols on the keyboard may lead to code less verbose than java, but, when you are learning the specific classes in question, it's pretty impenetrable. Again, I feel C++ at work.
  • Having to look at some SBT builds. Never use "Simple" in a spec, it's as short-term as "Lightweight" or "New". I think I'll use "Complicated" in the next thing I build, to save time later.
Now, what's it like going back to doing some Java code? What do I miss?
  • Type inference. Even though its type inference is a kind that Milner wouldn't approve of, it's better than not having one.
  • Semicolons being mostly optional. Its so easy to keep writing broken code.
  • val vs var. I know, Java has "final", but its so verbose we all ignore it.
  • Variable expansion in strings. Especially debugging ones.
  • Closures. I look forward to the leap to Java 8 coding there.
  • Having to declare exceptions. Note than in Hadoop we tend to say "throws IOException", which is a slightly less blatant way of saying everything "throws Exception". We have to consider Java's explicit exception naming idea one not to repeat on the grounds it makes maintenance a nightmare, and precludes different implementations of an interface from having (explicitly) different failure modes. 
You switch back to Java and the code is verbose —and a lot of that is due to having to declare types everywhere, then build up lists and maps one by one, iterating over them equally slowly. Again, Java 8 will narrow some of that gap.

When I go back to java, what don't I miss?
  • A compiler that crawls. I don't know why it is so slow, but it is. I think the sheer complexity of the language is a likely cause.
  • Chained over-terseness. Yes, I can do a t.map.fold.apply chain in Spark, but when you see a stack trace, having one action per line makes trying to work out what went wrong possible. It's why I code that way in Java, too. That said, I find myself writing more chained operations, even at the cost of stack-trace debuggability. Terseness is corrupting.
One interesting thing is that even though I've done big personal projects in Standard ML, we didn't do team projects in it. While I may be able to sit down and talk about correctness proofs of algorithms in functional programming, I can't discuss maintenance of someone else's code they wrote in a functional language, or how to write something for others to maintain.

Am I going to embrace Scala as the one-and-true programming language? No. I don't trust it enough yet, and I'd need broader use of the language to be confident I was writing things that were the right architecture.

What about Scala as a data-engineering language? one stricter than Python, but nimble enough to use in notebooks like Zepplin?

I think from a pure data-science perspective, I'd say "Work at the Python level". Python is the new SQL: something which, once learned, can be used broadly. Everyone should know basic python. But for that engineering code, where you are hooking things up, mixing in existing Java libraries, Hadoop API calls and using things like Spark's RDDs and Dataframes, Scala works pretty well.


Time on multi-core, multi-socket servers

Stokes Croft Graffiti, Sept 2015

In Distributed Computing the notion of "when-ness" is fundamental; Lamport's "Time, Clocks, and the. Ordering of Events in a Distributed System" paper is considered one of the foundational pieces of work.

But what about locally?

in the Java APIs, we have: System.currentTimeMillis() and System.nanoTime() to return time.

we experienced developers "know" that currentTimeMillis() is on the "wall clock", so that if things happen to that clock: manual/NTP clock shifts, VM migration, that time can suddenly jump to a new value. And for that reason, nanoTime() is the one that we should really be using to measure time, monotonically.

Except I now no longer trust it. I've known for a long time that CPU frequency could change its rate, but as of this week I've now discovered that on a multi-socket (And older multi-core system), the nanoTime() value may be or more of:

  1. Inconsistent across cores, hence non-monotonic on reads, especially reads likely to trigger thread suspend/resume (anything with sleep(), wait(), IO, accessing synchronized data under load).
  2. Not actually monotonic.
  3. Achieving a consistency by querying heavyweight counters with possible longer function execution time and lower granularity than the wall clock.
That is: modern NUMA, multi-socket servers are essentially multiple computers wired together, and we have a term for that: distributed system.

The standard way to read nanotime on an x86 part is reading the TSC counter, via the RDTSC opcode. Lightweight, though actually a synchronization barrier opcode.

Except every core in a server may be running at a different speed, and so have a different value for that counter. When code runs across cores, different numbers can come back.

In Intel's Nephalem chipset the TSC is shared across all cores on the same die, and clocked at a rate independent of the CPU: monotonic and consistent across the entire socket. Threads running in any core in the same die will get the same number from RDTSC —something that System.nanoTime() may use.

Fill in that second socket on your server, and you have lost that consistency, even if the parts and their TSC counters are running forwards at exactly the same rate. Any code you had which relied on TSC consistency is now going to break.

This is all ignoring virtualization: the RDSTC opcode may or may not be virtual. If it is: you are on your own.

Operating systems are aware of this problem, so may use alternative mechanisms to return a counter: which may be neither monotonic nor fast.

Here then, is some reading on the topic
The conclusion I've reached is that except for the special case of using nanoTime() in micro benchmarks, you may as well stick to currentTimeMillis() —knowing that it may sporadically jump forwards or backwards. Because if you switched to nanoTime(), you don't get any monotonicity guarantees, it doesn't relate to human time any more —and may be more likely to lead you into writing code which assumes a fast call with consistent, monotonic results.


iOS 8.4, the windows vista of ipad

I like to listen to music while coding, with my normal strategy being "on a monday, pick an artist, leave them playing all week". It's a low-effort policy. Except iOS 8.4 has ruined my workflow.

iOS 8.4 music player getting playlists broken by building a sequence of the single file in the list

Now while I think Ian Curtis's version of Sister Ray is possibly better than the Velvet Underground's, it doesn't mean I want to listen to it completely, yet this is precisely what it appears to want to do. Both when I start the playlist, and sometimes even when it's been happily mixing the playlist. Sometimes it just gets into a state where the next (shuffled) track is the same as the current track, forever. And before anyone says "hey, you just hit the repeat-per-track option", here's the fun bit: it switched from repeat playlist to repeat track, all on its own. That implies a state change, either in some local variable (how?) or that the app is persisting state and reloading it, and that persist/reload changed the value. As a developer, I suspect the latter, as it's easier to get a save/restore of an enum wrong.

The new UI doesn't help. Apparently using Siri helps, as you can just say "shuffle" and let it do the rest. I couldn't get that far. Because every time I asked it to play my music, it warns me that this will break the up next list. That's the one that's decided my entire playlist consists of one MP3, Sister Ray covered by Joy Division.

If one thing is clear:not only is the UI of iOS 8.4 music a retrograde step, it was shipped without enough testing, or with major known bugs. I don't know which is worse.

Siri (and Cortana, OK google and Alexa) are all showing how speech recognition has improved over time, and we are starting to see more AI-hard applications running in mobile devices (google translate on Android is another classic) —but it's moot if the applications that the speech recognitions systems are hooked up to are broken.

Which is poignant, isn't it:
  • Cutting edge speech recognition software combining mobile devices & remote server-side processing: working
  • Music player application of a kind which even Apple have been shipping for over a decade, and which AMP and Napster had nailed down earlier: broken.
The usual tactic: rebooting? All the playlists are gone, even after a couple of attempts at resyncing with itunes.

That's in then: I cannot use the Apple music app on the iPad to listen to my music. Which given that a key strategic justification for the 8.4 release is the Apple Music service, has to be a complete disaster.

This reminds me so much of the windows Vista experience: an upgrade that was a downgrade. I had vista on a laptop for a week before sticking linux on. I don't have that option here, only the promise that iOS 9 will make things "better"

I would go back to the ipod nano, except I can't find the cable for that, so have switched to google play and streaming my in-cloud music backup. Which, from an apple strategic perspective, can't rate very highly, not if I am the only person abandoning the music player for alternatives that actually work.


Book Review, Hadoop Security, and distributed security in general

I've been reading the new ORA book, Hadoop Security, by Ben Spivey and Joey Echeverria. There's not many reviews up there, so I'll put mine up

  • reasonable intro to kerberos hadoop clusters
  • covers the identity -> cluster user mapping problem
  • ACLs in HDFS, YARN &c covered nicely —explanation and configuration
  • Shows you pretty much how to configure every Hadoop service for authenticated and authorized access, audit loggings and data & transport encryption.
  • has Sentry coverage, if that matters to you
  • Has some good "putting it all together" articles
  • Index seems OK.
  • Avoids delving into the depths of implementation (strength and weakness)

Overall: good from an ops perspective, for anyone coding in/against Hadoop, background material you should understand —worth buying.

Securing Distributed Systems

I'd bought a copy of the ebook while it was still a work in progress, so I got to see the original Chapter 2, "securing distributed systems: chapter to come". I actually think they should have left that page as it is on the basis that Distributed System Security is a Work in Progress. And while it's easy for all of us to say "defence in depth", none of us really practice that properly even at home. Where is the two-tier network with the fundamentally untrustable IoT layer: TVs, light bulbs, telephones, bittorrent servers, on a separate subnet from the critical household infrastructure from the desktops, laptops and home servers. How many of us keep our ASF, SSH and github credentials on an encrypted USB stick which must be unlocked for use? None of us. Bear that in mind whenever someone talks about security infrastructure: ask them how they lock down their house. (*)

Kerberos is the bit I worry about day to day, so how does it stack up?

I do think it covers the core concepts-as-a-user, and has a workflow diagram which presents time quite nicely. It avoids going in to those details of the protocol, which, as anyone who has ever read Colouris & Dolimore will note, is mindnumbingly complex and does hit the mathematics layer pretty hard. A good project for learning TLA+ would probably be "specify Kerberos"

ACLs are covered nicely too, while encryption covers HDFS, Linux FS and wire encryption, including the shuffle.

There's coverage of lots of the Hadoop stack, core Hadoop, HBase, Accumulo, Zookeeper, Oozie & more. There's some specifics on Cloudera bits: Impala, Sentry, but not exclusively and all the example configs are text files, not management tool centric: they'll work everywhere.

Overall then: a pretty thorough book on Hadoop security, for a general overview of security, Kerberos, ACLs and configuring Hadoop it brings together everything in to one place.

If you are trying to secure a Hadoop cluster, invest in a copy


Now, where is it limited?

1. A lot of the book is configuration examples for N+ services & audit logs. it's a bit repetitive, and I don't think anybody would sit down and type those things in. However, there are so many config files in the Hadoop space, and at least how to configure all the many services is covered. It just hampers the readability of the book.

2. I'd have liked to have seen the HDFS encryption mechanism illustrated, especially KMS integration. It's not something I've sat down to understand, and the same UML sequence diagram style used for Kerberos would have gone down.

3. It glosses over precisely how hard it is to get Kerberos working, how your life will be frittered away staring at error messages which make no sense whatsoever, only for you to discover later they mean "java was auto updated and the new version can't do long-key crypto any more". There's nothing serious in this book about debugging a Hadoop/Kerberos integration which isn't working.

4. Its bit on coding against Kerberos is limited to a couple of code snippets around UGI login and doAs. Given how much pain it it takes to get Kerberos to work client side, including ticket renewal, delegation token creation, delegation token renewal, debugging, etc, one and a half pages isn't even a start.

Someone needs to document Hadoop & Kerberos for developers —this book isn't it.

I assume that's a conscious decision by the authors, for a number of valid reasons
  • It would significantly complicate the book.
  • It's a niche product, being for developers within the Hadoop codebase.
  • It'd make maintenance of the book significantly harder.
  • To write it, you need to have experienced the pain of adding a new Hadop IPC, writing client tests against in-VM zookeeper clusters locked down with MiniKDC instances, or tried to debug why Jersey+SPNEGO was failing after 18 hours on test runs.
The good news is that I have experience the suffering of getting code to work on a secure Hadoop cluster, and want to spread that suffering more broadly.

For that reason, I would like to announce the work in progress, gitbook-toolchained ebook:

Kerberos and Hadoop: The Madness beyond the Gate

This is an attempt to write down things I've learned, using a Lovecraftian context to make clear this is forbidden knowledge that will drive the reader insane**. Which is true. Unfortunately, if you are trying to write code to work in a Hadoop cluster —especially YARN applications or anything acting as a service for callers, be they REST or IPC, you need to know this stuff.

It's less relevant for anyone else, though the Error Messages to Fear section is one of the things I felt the Hadoop Security book would have benefited from.

As noted, the Madness Beyond the Gate book is a WiP and there's no schedule to extend or complete it —just something written during test runs. I may finish it; I may get bored and distracted. But I welcome contributions from others, together we can have something which will be useful for those people coding in Hadoop —especially those who don't have the luxury of knowing who added Kerberos support to Hadoop, or has some security experts at the end of an email connection to help debug SPNEGO pain.

I've also put down for a talk on the same topic at Apachecon EU Data —let's see if it gets in.

(*) Flash removed except on Chrome browsers which I've had to go round and updated this week. The two-tier network is coming in once I set up a rasberry pi as the bridge, though with Ether-over-power the core backbone, life is tricky. And with PCs in the "trust zone", I'm still vulnerable to 0-days and the hazard imposed by other household users and my uses of apt-get, homebrew and maven & ivy in builds.I should really move to developing in VMs I destroy at the end of each week.

(**) plus it'd make for fantastic cover page art in an ORA book.