Build tools as Proof Engines


Someone has put up a thoughtful post on whether you can view make as a proof system. I thought about it and have come up with a conclusion: maybe in the past —but not any more.

As background, it's worth remembering
  1. I did "write the book on Ant", which is a build system in which you declare the transformational functions required to get your application into a state which you consider shippable/deployable, a set of functions which you define in a DAG for the Ant execution engine to generate and order from and then apply to your environment
  2. Early last year, in my spare time, I formally specified the Hadoop FS APIs in Spivey's Z notation, albeit in a Python syntax.
  3. In YARN-913 I went on to specify the desired behaviour of the YARN registry as a TLA+ specification.
And while I don't discuss it much, during my undergraduate work on Formal Specification of and implementation Microprocessors, I was using Gordon's HOL theorem prover. The latter based on Standard ML, just to fend off anyone who believed me when I was claiming not to understand functional programming earlier this week. I didn't meant it, it just entertained the audience. Oh, and did I mention I'm happy to write Prolog too?

This means that (a) I believe I understand about specifications, (b) I vaguely remember what a proof engine is, and (b) I understand how Prolog resolves things, and specifically, why "!" is the most useful operator when you are trying to write applications.

Now I note that the author of that first post, Bob Atkey, does not only has a background of formal logic, SML and possibly even worked with the same people as me, his knowledge is both greater and more up to date than mine. I just have more experience of breaking builds and getting emails from jenkins telling me this.

Now, consider a software project's build
  1. A set of source artifacts, S, containing artifacts s1..sn
  2. A declaration of the build process, B
  3. A set of external dependencies, libraries, L.
  4. A build environment, E., comprising the tools needed for the build, and the computer environment around them, including the filesystem, OS, etc.
The goal is to reach a desired state of a set of redistributables, R, such that you are prepared to ship or deploy them.

The purpose of the build system, then, is to generate R from S through a series of functions applied to (S, L) with tools T within the environment E. The build process description, B, defines or declares that process.

There's many ways to do that; a single line bash file cc *.c && cc *.o could be enough to compile the lexical analyser example from Ch02 of the dragon book.

Builds are more complex than that, which is where tools like make come in.

Make essentially declares that final list of redistributables, R, and a set of transformations from inputs artifacts to output artifacts, including the functions (actions by the tools) to generate the output artifacts from the input artifacts.

The make runtime looks at what artifacts exist, what are missing, and what are out of date somehow builds a chain of operations that are hypothesised to produce a complete set of output artifact whose timestamp in the filesystem is later than that of the source files.

It is interesting to look at it formally, with a rule saying that to derive .o from .c, a function "cc -c" is applied to the source. Make looks at the filesystem for the existence of that .o file, its absence or "out-of-dateness" and, if needed applies the function. If multiple rules are used to derive a sequence of transformations then make will build that chain then execute them.

One interesting question is "how does make build that series of functions, f1..fn, such that:

R = fn(fn-1(fn-2(fn-3..f1(S, L)

I believe it backchains from the redistributes  to build a series of rules which can be applied, then runs those rules forwards to create the final output.

If you view the final redistributables as a set of predicates whose existence is taken as truth, absence as false, and all rules are implies operators the define a path to truth, not falsehood (i.e. we are reasoning over Horn Clauses, then yes, I think you could say "make is a backward chaining system to build a transformation resulting in "truth".

The problem is this: I'm not trying to build something for the sake of building it. I'm trying to build something to ship. I'm running builds and tests repeatedly, from my laptop on my desk, from my laptop in a conference auditorium, a hotel room, and a train doing 190 mph between Brussels and London. A that's all just this week.

Most of the time, those tests have been failing. There are three possible (and non-disjoint) causes of this
  • (a) the null hypothesis: I can't write code that works.
  • (b) a secondary hypothesis: I am better at writing tests to demonstrate the production code is broken than I am at fixing the production code itself.
  • (c) as the external environment changes, so does the outcome of the build process.

Let's pretend that (a) and (b) are false; that I can actually write code that works first time, with the tests intended to show that this condition is not met being well written and correct themselves. Even if such a case held, my build would have been broken for a significant fraction of the time it was this week.

Here's some real examples of fun problems, for "type 3 fun" on the Ordnance Survey Fun Scale.
  1. Build halting as the particular sequence of operations it had chosen depended on maven artifacts which were not only absent, but non-retrievable from a train somewhere under the English Channel.
  2. Connection Reset exceptions talking to an in-VM webapp from within a test case. A test case that worked last week. I never did find the cause of this. Though I eventually concluded that it last worked before I installed a critical OS.X patch (last weeks, not this week's pending one). The obvious action was "reboot the mac' —and lo, it did fix it. I just spent a lot of time on hypotheses (a) and (b) before settling on cause (c)
  3. Tests hanging in the keynote sessions because while my laptop had an internet connection, GET requests against java.sun.com were failing. It turns out that when Jersey starts some servlets up, it tries to do a GET for some XSD file in order to validate some web.xml XML document. If the system is offline, it skips that. But if DNS resolves java.sun.com, then it does a GET and blocks until the document is returned or the GET fails. As as the internet in keynote was a bit overloaded, tests just hung. Fix: edit /etc/hosts to put java.sun.com == or turn off the wifi.
  4. A build at my hotel failing as the run crossed the midnight marker and maven decided to pull down some -SNAPSHOT binaries from mvn central, which were not the binaries I'd just built locally, during the ongoing build and test run.
What do all these have in common? Differences in the environment of the build, primarily networking, except case (4), which was due to the build taking place at a different time from previous builds.

Which brings me to a salient point
The environment of a build includes not only the source files and build rules, it includes the rest of the Internet and the connections to it. furthermore, as the rule engine uses not just the presence/absence of intermediate and final artifacts as triggers for actions, time is an implicit part of the reasoning.

You made could make explict that temporal logic, and have a build tool which look at the timestamp of newly created files in /tmp and flagged up when your filesystem was in a different timezone (Oh, look ant -diagnostics does that! I wonder who wrote that probe?) But it wouldn't be enough, because we've reached a point in which builds are dependent upon state external to even the local machine.

Our builds are therefore, irretrievably nondeterministic.


InfoSec risks of android travel applications

I was offline for 95% of the xmas break, instead investing my keyboard time into: (a) the exercises in Structure and Interpretation of Computer Programs and (b) writing some stuff on the implications of the Sony debacle for my home network security architecture.

I'm going to start posting the latter articles in an out-of-order sequence, with this post: InfoSec risks of android travel applications

1. Airline checkin & Travel apps- demand so many privileges that you can't trust corporate calendar/contact data to stay on the devices. Nor, in the absence of audit logs, can you tell if the information has leaked.

2. Budget Airline applications are the least invasive, "premium" airlines demand access to confidential calendar info.

3. Even train timetable apps like to know things like your contact list.

However hard you lock down your network infrastructure, mandate 26 digit high-unicode passwords rolled monthly, mandate encrypted phones and pin-protected SIM cards, if those phones say "android" when they boot you can't be confident that sensitive corporate data isn't leaking out of those phones if the users expect to be able to use their phones to check on buses, trains or airplanes.


Normally the fact that Android apps can ask and get near-unlimited data access is viewed as a privacy concern. It is for home users. Once you do any of the following, it becomes an InfoSec issue:
  • Synchronise calendar with a work email service.
  • Maintain a contact list which includes potentially confidential contact/customers
  • Bond to a work Wifi network which offers network access to HTTP(S) sites without some form of auth.
  • Do the same via VPN
What is fascinating is that apps asking for access to calendar info -especially "confidential event information"- is something that mainstream airline travel apps demand in exchange for giving you the ability to check in to a flight on your phone, look at schedules and your tickets. Android does not provide a way to directly prevent this.

Demands of Applications

Noticing that one application update needed to want more information than I was expected, I went through all the travel apps on my android phone and looked at what permissions they demanded. These weren't explicitly installed for the experiment, simply what I use to fly on airlines, and some train and bus ones in the UK. I'm excluding tripit on the basis that their web infrastructure requests (optional) access to your google emails to autoscan for trip plans, which is in a different league from these.

British Airwaysconfidential, participantsNoYesPrecise
United Airlinesconfidential, participantsNoYes; view network connectionsPrecise
National RailAdd, modify, participantsNoYesPrecise
National Express CoachNoYesYes; view network connections & wifiPrecise
First Great Western trainsNoNoYesPrecise
trainlineNoNoYes; view network connectionsPrecise
First BusNoNoYes; view network connectionsPrecise

When you look at this list, its appalling. Why does the company that I use to get a bus to LHR need to know my contact list? Why does BA need my confidential appointment data? Why does the UK National Rail app need to be able to enumerate the calendar and send emails to participants without the owner's knowledge?

British Airways: wants access to confidential calendar info and full network access. What could possibly go wrong?

United: wants to call numbers, take photos and access confidential calendar info

National Express Bus Service
This is a bus company. How can they justify reading my contact list -business as well as personal?

UK National Rail
Pretty much total phone control, though not confidential appointment info. Are event titles considered confidential though?

Google's business model is built on knowing everything about your personal life -but this isn't about privacy, it is about preventing data leakage from an organisation. If anyone connects to your email services from an android, your airline checkin apps get to see the title, body and participants in all calendar appointments, whether that is "team meeting" or "plans for takeover of Walmart" where the participants include Jim Bezos and Donald Trump(*).

What could be done?
  1. Log accesses. I can't see a way to do this today, yet it would seem a core feature IT security teams would like to know. Without it you can't tell what information apps have read.
  2. Track provenance of calendar events and restrict calendar access only to events created by the airline apps themselves. This would require the servers to add event metadata; as google own gmail they could add a new BigTable column with ease.
  3. Restrict network access HTTPS sites on specific subdomains. Requiring HTTPS is good for general wifi security, and stops (most) organisations from playing DNS games to get behind the firewall.
Above and beyond that: allow users to easily restrict what privileges applications actually get. Don't want to give an app access to the contacts? Flip a switch and have the API call return an empty list. Want to block confidential calendar access? Another switch, another empty payload. Apple lets me do that with foreground/background location data collection on their devices through a simple list of apps, but google doesn't.

In the absence of that feature, if you want to be able to check in on your android phone on a non-budget airline, you have to give up expectations of the security of your confidential calendar data and contact list.

And in a world of BYOD, where the IT dept doesn't have control of the apps on a phone, that means they can't stop sensitive calendar/contact data leaking at all.

(*) FYI, there are no appointments in my calendar discussing taking over Walmart that include both Jim Besos and Donald Trump. I cannot confirm or deny any other meetings with these participants or plans for Walmart involving other participants. Ask British Airways or UAL if you don't believe me.

"It is not necessary to have experience of Hadoop"

I don't normally post LinkedIn approaches, especially from our competitors, but this one was so painful it blew my "do your research" criteria so dramatically it merits coverage.

FWIW my reply was: this is some kind of spoof, no?

Col du Barrioz

On 01/16/15 2:56 AM, Jessica <surname omitted to avoid embarrassment> wrote:    
Hi Steve,

I hope you are well?

We are currently hiring at Cloudera to expand our Customer Operations Engineering team.

We are looking to build this team significantly over the coming months and this is a rare opportunity to become involved in Cloudera's Engineering department.

The role is home based with very little travel required (just for training).

We are looking for people with strong Linux backgrounds and good experience with programming languages. It is not necessary to have experience of Hadoop - we will teach you !!

For the chance to be part of this team please send me your CV to <email omitted to avoid embarrassment>@cloudera.com alternatively we can organise a time to speak for me to tell you more about the role?



As an aside, I am always curious why recruiter emails always start with "I hope you are well?".

a) We both know that the recruiter doesn't care about my health as long as it doesn't impact my ability to work with colleagues, customers and, once my training in Hadoop is complete, maybe even to learn understand how to use things like DelegationTokenAuthenticatedURL —that being what I am staring at right now.(*)

b) We both know that she doesn't actually want details like "well, the DVLA consider my neurological issues under control enough for me to drive again —even down to places like the Alps, and the ripped up tendon off my left kneecap is manageable enough for me to do hill work when I get there"

(*) If anyone has got Jersey+SPNEGO hooked up to UserGroupInformation, I would love that code.


What have we learned this week?

The Sony/N Korea spat is fascinating in its implications

Ape of ninetree

  1. Never make an enemy of a nation state
  2. Any sufficiently large organisation is probably vulnerable to attack. I even worry about the attack surface of two adults and child, and I don't know who is the greater risk: the other adult or the child. The latter I am teaching foundational infosec to, primarily as he learns to break household security to boostrap access to infrastructure facilities to which he is denied access (e.g. the password to the 5GHz wifi network that doesn't go off at 21:00).
  3. Always encrypt HDDs with per-user keys. Any IT keys need to be locked down extra-hard.
  4. Never store passwords in plaintext files. At the very least, encrypt your word documents.
  5. Never email passwords to others. That goes for wifi passwords, incidentally, as children may come across unattended gmail inboxes and search for the words "wifi password"
  6. Never write anything in an email that you would be embarrassed to see public. Not confidential, simply unprofessional stuff that would make you look bad.
  7. The US considers a breach of security of a global organisation possibly by a nation state an act of terrorism.
The final one is something to call out. Nobody died here. It's cost money and has restricted the right of people round the world to watch something mediocre, but no lives were lost. Furthermore, and is salient, *it was not an attack on any government or national infrastructure*. This was not an attack on the US itself.

In comparison, the Olympic Games/Stuxnet attack on the Iranian nuclear enrichment facility was a deliberate, superbly executed attack on the Iranian government, to their "peaceful enrichment project"/stage 1 nuclear weapons program. That was a significantly more strategic asset than emails criticising Adam Sandler (*).

By inference, if an information-leak attack on a corporate entity is terrorism, mechanical sabotage of a nation's nuclear program must be viewed as an act of war.

That doesn't mean it wasn't justified, any less than the Israeli bombing of a Syrian facility a few years back. And at least here the government(s) in question did actually target a state building WMDs rather than invade one that didn't, leave it in a state of near-civil-war and so help create the mess we get today (**).

Yet what -someone- did, was commit an act of war of war against an other country, during "peacetime". And got away with it.

Which is profound. Whether it is an attack or Iranian nuclear infrastructure, or a data grab and dump at Sony, over-internet-warfare is something that is taking place today, in peacetime. It's the internet's equivalent of UAV attacks: small scale operations with no risk to the lives of your own-side, hence politically acceptable. Add in deniability and it is even better. Just as the suspects of the Olympic Games actions, apparently the US & Israel, deny that project while being happy with the outcome, so here can N. Korea say "we laud their actions even though we didn't do it"

Well, the US govt. probably set the precendet in Operation Olympic Games. It can't now look at what happened to Sony and say "this isn't fair" or "this is an act of war". As if it is, we are already at war with Iran -and before long, likely to be at war with other countries.

(*) Please can Team Netflix add a taste-preferences option that asks about specific actors, so i can say "never recommend anything by Adam Sandler" without having to 1-* everyone they throw up and so let it learn indirectly that I despise his work?

(**) On that topic, why is Tony Blair still a middle east peace envoy?


Distributed Computing Papers #1: Implementing Remote Procedure Calls

Col de la Machine

Stonebreaker is at it again, publishing another denunciation of the Hadoop stack, with the main critique being its reimplentations of old stuff at google. Well, I've argued in the past that we should be thinking past google, but not for the reason Stonebraker picks up.

My issue is that hardware and systems are changing and that we should be designing for those future designs: mixed-moving-to-total SSD, many-core NUMA, faster networking, v18n as deployment. Which is what people are working on in different parts of Hadoop-land, I'm pleased to say.

But yes, we do like our google papers —while incomplete, google still goes to a lot of effort to describe what they have built. We don't get to read much about, say, Amazon's infrastructure, by comparison. Snowden seems to have some slideware on large government projects, but they don't go into the implementation details. And given the goal the goal of one of the projects was to sample Yahoo! chat frames then we can take it as a given that those apps are pretty dated too. That leaves: Microsoft, Yahoo! Labs, and what LinkedIn and Netflix are doing —the latter all appearing in source form and integrating with Hadoop from day one. No need to read and implement when you can just git clone and build locally.

There's also the very old work, and if everyone is going to list their favourite papers —I'm going to look at some other companies: DEC, Xerox, Sun and Apollo.

What the people there did then was profound at the time, and much of it forms the foundation of what we are building today. Some aspects of the work have fallen by the wayside —something we need to recognise, and consider whether that happened because it was ultimately perceived as a wrongness on the face of the planet (CORBA), or because some less-powerful-yet-adequate interim solution became incredibly successful. We also need to look at what we have inherited from that era, whether the assumptions and goals of that period hold today, and consider the implications of those decisions as applied to today's distributed systems.

Starting with Birrell and Nelson, 1984, Implementing Remote Procedure calls.

This is it: the paper that said "here is how to make calling a remote machine look like calling some subroutine on a local system".
The primary purpose of our RPC project was to make distributed computation easy. Previously, it was observed within our research community that the construction of communicating programs was a difficult task, undertaken only by members of a select group of communication experts.
Prior to this, people got to write their own communication mechanism, whatever it was. Presumably some low-level socket-ish link between two processes and hand-marshalled data.

After Birrell's paper, RPC became the way. It didn't just explain the idea, it showed the core implementation architecture: stubs on the client, a service "exporting" a service interface by way of a server-side stub and the RPC engine to receive requests, unmarshall them and hand them off to the server code.

It also did service location with Xerox Grapevine, in which services were located by type and instance. This allowed instances to be moved around, and for sets of services to be enumerated. It also removed any hard-coding of service to machine, something that classic DNS has tended to reinforce. Yes, we have a global registry, but now we have to hard-code services to machines and ports "namenode1:50070", then try to play games with floating IP addresses which can be cached by apps (JVMs by default), or pass lists of the set of all possible hosts down to the clients...tricks we have to do because DNS is all we are left with.

Ignoring that, RPC has become one of the core ways to communicate between machines. For people saying "REST!", if the client-side implementation is making what appear to be blocking procedure calls, I'm viewing it as a descendent of RPC...same if you are using JAX-RS to implement a back end that maps down to a blocking method call. That notion of "avoid a state machine by implementing state implicitly in the stack of the caller and its PC register" is there. And as Lamport makes clear: a computer is a state machine.

Hence: RPC is the dominant code-side metaphor for client and server applications to talk to each other, irrespective of the inter-machine protocols.

There's a couple of others that spring to mind:

message passing. Message passing comes into and falls out of fashion. What is an Enterprise Service Bus but a very large message delivery system? And as you look at Erlang-based services, messages are a core design. Then there's Akka: message based passing within the application, which is starting to appeal to me as an architecture for complex applications.

shared state. entries in a distributed filesystem, zookeeper znodes, tuple-spaces and even RDBMs tables cover this. Programs talk indirectly by publishing something to a (possibly persistent) location, recipients poll or subscribe to changes.

We need to look at those more.

Meanwhile, returning to the RPC paper, another bit of the abstract merits attention
RPC will, we hope, remove unnecessary difficulties, leaving only the fundamental difficulties of building distributed systems: timing, independent failure of components, and the coexistence of independent execution environments.
As the authors call out, all you need do do is handle "things happening in parallel" and "things going wrong". They knew these issues existed from the outset, yet the way RPC makes the fact that you are calling stuff remotely "transparent", it's easy for us developers to forget about the failure modes and the delays.

Which is the wonderful and terrible thing about RPC calls: they look just like local calls until things go wrong, and then they either fail unexpectedly, or block for an indeterminate period. Which, if you are not careful, propagates.

Case in point, HDFS-6803, "Documenting DFSClient#DFSInputStream expectations reading and preading in concurrent context". That's seems such a tiny little detail, wondering if "should the positioned read operation readFully(long position, byte[] buffer, int offset, int length) be thread safe and, if the implementation is up to it, reentrant?".

To which my current view is yes, but we shouldn't make it invisible to others. Why? If we try to hide that a seek/read/seek sequence in progress, you either have to jump through hoops caching and serving up a previous position, or make getPos() synchronize, such as proposed in HDFS-6813. Which effectively means that getPos() has been elevated from a lightweight variable fetch, to something that can block until an RPC call sequence in a different thread has completed, successfully or otherwise. And that's a very, very different world. People calling that getPos() operation may not expect an in-progress positioned read to surface, but nor do they expect it to take minutes —which it can now do.

And where do we have such potentially-blocking RPC calls inside synchronized method calls? All over the place. At least in my code. Because most of the time RPC calls work -they work so well we've gotten complacent.

(currently testing scheduled-window-reset container failure counting)
(photo: looking ahead on Col de La Machine, Vercours Massif — one of the most significantly exposed roads I've ever encountered)


Modern Alpinism

I am back from a 2+ week vacance in the French Alps —the first since 2009. More specifically, the first since the DVLA took away my driving license on the basis that I was medically unfit to drive. They now consider me no more dangerous than anyone else, so we could drive down there. My main issue with driving now is that I've forgotten things: motorway awareness, the width of my car (very important in bristol), what it feels like to go round corners fast enough for car to indicate through the steering wheel that it doesn't want to —and worst of all, how to parallel park into small gaps.

Two weeks in the French Alps, along with 2-day approaches/retreats is a good revision there, giving me the learning matrix of
Day Night Bad Weather
French village
French urban (excluding Paris)
Windy country roads
Alpine passes

There's a few left, and I didn't try any overtakes, but otherwise that's reasonable coverage of the configuration space.

While there: MTB work, some on-road work including a 100 mile epic, a bit of basic Alpine hut hiking/scrambling —as well as staying with friends, eating too many cheese-based products (tartiflettes &c), consuming french beer and cidre, catching the Tour de France Chamrousse Stage and generally having fun.

I got to review some of the places last visited in 2009:
break time

Where I'm pleased to see that I don't seem significantly rounder

Col du Barrioz

The main difference this time was that someone was rinsing their clothes in this village fountain half-way up Col du Barrioz, so we had to sprint off before we got harassed, crossing the pass and sheltering from some rain in a bar/tabac where I opted to not drink that day

Col du Barrioz

a policy which the camera's EXIF data implies lasted exactly nine minutes

Col du Barrioz

Anyway: an excellent holiday —it was great to be back in the Alps.


Can I have password for my hotel room's shower?

A conversation you never get at a hotel when you check in:
"how many people will be having showers?"
"oh, three of us"
"OK, here are three vouchers for hot water. Keep them handy as you'll need to retype them at random points in the day"
"thank you. Is the login screen in a random EU language and in a font that looks really tiny when I try to enter it, with a random set of characters that are near impossible to type reliably on an on-screen keyboard especially as the UI immediately converts them to * symbols out of a misguided fear that someone will be looking over my shoulder trying to steal some shower-time?"
"Why, yes -how very perceptive of you. Oh, one more thing -hot water quotas"
"hot water quotas?"
"yes, every voucher is good for 100 Litres of water/day. If you go over that rate then you will be billed at 20 c/Litre."
That's a lot!
"Yes, we recommend you only have quick showers. But don't worry, the flow rate of the shower is very low on this hot water scheme, so you can still have three minutes worth of showering without having to worry"
"'this' hot water scheme?"
"yes -you can buy a premium-hot-water-upgrade that not only gives you 500L/day, it doubles the flow rate of the shower.
"oh, I think I will just go to the cafe round the corner -they have free hot water without any need for a login"
"if that is what you want. Is there anything else?"
"Yes, where is my room?"
"It's on the 17 floor -the stairs are over there. With your luggage you could get everything up in two goes -it will only take about fifteen minutes"
"17 floors! Fifteen Minutes! Don't you have a lift?"
"Ah -do you mean our premium automated floor-transport service?  Why yes, we do have one. It won't even add much to your bill. Would you like to buy a login? First -how many people will plan on using the lift every day -and how many times?