2015-05-21

It's OK to submit patches without tests: just show the correctness proofs instead


Bruxelles

In a window adjacent to the browser I'm typing this, I'm doing a clean build of trunk prior to adding a two line patch to Hadoop -and the 20+ lines needed to test that patch; Intellij burbles away for 5 minutes as it does on any changes to the POMs on a project which has a significant of the hadoop stack loaded.

The patch I'm going write is to fix a bug introduced by a previous three line patch, one that didn't come with patches because it was "trivial".

It may have been a trivial patch, but it was a broken trivial patch. It's not so much that the test cases would have found this, but they'd have forced the author to think more about the inputs to the two-line method, and what outputs would be expected. Then we get some tests that generate the desired outputs for the different outputs, ones that guarantee that over time the behaviour is constant.

Instead we have a colleague spending a day trying to track down a remote functional test run, one that has been reduced to a multi-hop stack trace problem. The big functional test suites did find that bug (good), but because the cost of debugging and isolating that failure is higher handling that failure is more expensive.; With better distributed test tooling, especially log aggregation and analysis, that cost should be lower —but it's still pretty needless for something that could have been avoided simply by thinking through the inputs.

Complex functional system tests should not be used as a substitute for unit tests on isolated bits of code. 

I'm not going to highlight the issue, or identify who wrote that patch, because it's not fair: it could be any of us, and I am just as guilty of submitting "trivial" patches. If something is 1-2 lines long, it's really hard to justify in your brain the effort of writing the more complex tests to go with it.


If the code actually works as intended, you've saved time and all is well. But if it doesn't, that failure shows up later in full stack tests (cost & time), the field (very expensive), and either way ends up being fixed the way the original could have been done.

And as documented many times before: it's that thinking about inputs & outputs that forces you to write good code.

Anyway: I have tests to write now, before turning on to what is the kind of problem where those functional tests are justified, such as Jersey client not working on Java 8. (I can replicate that in a unit test, but only in my windows server/java 8 VM)

In future, if I see anyone use "trivial patch" as a reason to not write tests, I'll be wheeling out the -1 veto.

I do however, offer an exception: if people can prove their code works, I'll be happy

(photo: wall in Brussels)

2015-05-17

Distributed System Testing: where now, where next?

Confluent have announced they are looking for someone to work on an open source framework for distributed system testing.

I am really glad that they are sitting down to do this. Indeed, I've thought about sitting down to do it myself, the main reason I've been inactive there is "too much other stuff to do".

Crayola

Distributed System Testing is the unspoken problem of Distributed Computing. In single-host applications, all you need to do is show that the application "works" on the target system, with its OS,  enviroment (timezone, locale, ...), installed dependencies and application configuration.

In modern distributed computing you need to show that the distributed application works across a set of machines, in the presence of failures.

Equally importantly: when your tests fail, you need the ability to determine why they failed.

I think there is much scope to improve here, as well as the fundamental problem: defining works in the context of distributed computing.

I should write a post on that in future. For now, my current stance is: we need stricter specification of desired observable behaviour and implementation details. While I have been doing some Formal Specification work within the Hadoop codebase, there's a lot more work to be done there —and I can't do it all myself.

Assuming that there is a good specification of behaviour, you can then go on to defining tests which observe the state of the system, within the configuration space of the system (now including multiple hosts and the network), during a time period in which failures occur. The observable state should continue to match the specification, and if not, you want get the logs to determine why not. Note here that "observed state" can be pretty broad, and includes
  • Correct processing of requests
  • The ability to serialize an incoming stream of requests from multiple clients (or at least, to not demonstrate non-serialized behaviour)
  • Time to execute operations is one (performance),
  • Ability to support the desired request rate (scalability)
  • Persistence of state, where appropriate
  • Reslience to failures of : dependent services, network, hosts, 
  • Reporting of detected failure conditions to users and machines (operations needs)
  • Ideally: ability to continue in the presence of byzantine failures. Or at least detect them and recover.
  • Ability to interact with different versions of software (clients, servers, peers)
  • Maybe: ability to interact with different implementations of the same protocol.
I've written some slides on this topic, way back in 2006, Distributed Testing with SmartFrog. There's even a sub-VGA video to go with it from the 2006 Google Test Automation Conference.

My stance there was
  1. Tests themselves can be viewed as part of a larger distributed system
  2. They can be deployed with your automated deployment tools, bonded to the deployed system via the configuration management infrastructure
  3. You can use the existing unit test runners as a gateway to these tests, but reporting and logging needs to be improved.
  4. Data analysis is a critical area to be worked on.
I didn't look at system failures, I don't think I was worry enough about that, showing we weren't deploying things at scale, and before cloud computing took failures mainstream. Nowadays nobody can avoid thinking about VM loss at the very least.

Given I did those slides nine years ago, have things improved? Not much, no
  • Test runners are still all generating the Ant XML test reports written along with the matching XSLT transforms up by Stephane Balliez in 2000/2001
  • Continuous Integration servers have got a lot better, but even Jenkins, wonderful as it is, presents results as if they were independent builds, rather than a matrix of (app, environment, time). We may get individual build happiness, but we don't get reports  by test, showing that Hadoop/TestATSIntegrationFailures is working intermittently on all debian systems -but has been reliable elsewhere. The data is all there, but the reporting isn't.
  • Part of the problem is that they are still working with that XML format, one that, due to its use of XML attributes to summarise the run, buffers things in memory until the test test case finishes, then writes out the results. stdout and stderr may get reported -but only for the test client, and even then, there's no awareness of the structure of log messages
  • Failure conditions aren't usually being explicitly generated. Sometimes they happen, but then its complaints about the build or the target host being broken.
  • Email reports from the CI tooling is also pretty terse. You may get the "build broken, test XYZ with commits N1-N2", but again, you can get one per build, rather than a summary of overall system health.
  • With a large dependent graph of applications (hello, Hadoop stack!), there's a lot of regression testing that needs to take place —and fault tracking when something downstream fails. 
  • Those big system tests generate many, many logs, but they are often really hard to debug. If you haven't spent time with 3+ windows trying to sync up log events, you've not been doing test runs.
  • In a VM world, those VMs are often gone by the time you get told there's a problem.
  • Then there's the extended life test runs, the ones where we have to run things for a few days with Kerberos tokens set to expire hourly, while a set of clients generate realistic loads and random servers get restarted.
Things have got harder: bigger systems, more failure modes, a whole stack of applications —yet testing hasn't kept up.

In slider I did sit down to do something that would work within the constraints of the current test runner infrastructure yet still let us do functional tests against remote Hadoop clusters of variable size . Our functional test suite, funtests, uses Apache Bigtop's script launcher to start  Slider via its shell/py scripts. This tests those scripts on all test platforms (though it turns out, not enough locales), and forces us to have a meaningful set of exit codes —enough to distinguish the desired failure conditions from unexpected ones. Those tests can deploy slider applications on secure/insecure clusters (I keep my VM configs on github, for the curious), deploy test containers for basic operations, upgrade test, failure handling tests. For failure generation our IPC protocol includes messages to kill a container, and to have the AM kill itself with a chosen exit code.

For testing slider-deployed HBase and accumulo we go one step further. Slider deploys the application, and we run the normal application functional test suites with slider set up to generate failures.

How do we do that? With the Slider Integral Chaos Monkey. That's something which can run in the AM, and, at a configured interval, roll some virtual dice to see if the enabled failure events should be triggered: currently container and AM (we make sure the AM isn't chosen in the container kill monkey action, and have a startup delay to let the test runs settle in before starting to react).

Does it work? Yes. Which is good, because if things don't work, we've got the logs of all the machines in the cluster that ran slider to go through. Ideally, YARN-aggregated logs would suffice, but not if there's something up between YARN and the OS.

So: test runner I'm happy with. Remote deployment, failure injection, both structured and random. Launchable from my deskop and CI tooling; tests can be designed to scale. For testing rolling upgrades (Slider 0.80-incubating feature), we run the same client app while upgrading the system. Again: silence is golden.

Where I think much work needs to be done is what I've mentioned before: the reporting of problems and the tooling to determine why a test has failed.

We have the underlying infrastructure to stream logs out to things like HDFS or other services, there's nothing to stop us writing code to collect and aggregate those -with the recipient using the order of arrival to place an approximate time on events (not a perfect order, obviously, but better than log events with clocks that are wrong). We can collect those entire test run histories, along with as much environment information that we could grab and preserve. Junit: system properties. My ideal: VM snapshots & virtual network configs.

Then we'd go beyond XSLT reports of test runs and go to modern big data analysis tools. I'm going to propose here: Spark Why? so you can do local scripts, things in Jenkins & JUnit, and larger bulk operations. And for that get-your-hands-dirty test-debug festival, I can use a notebook like Apache Zepplin (incubating) can then be a no

we should be using our analysis tools for the automated analysis and reporting of test runs, the data science tooling for the debugging process.

Like I said, I'm not going to do all this. I will point to a lovely bit of code by Andrew Or @ databricks, spark-test-failures. which gets Jenkins's JSON-formatted test run history, determines flaky tests and posts the results on google docs. That's just a hint of what is possible —yet it shows the path forwards.

(Photo: "crayola", St Pauls:work commissioned by the residents)

2015-05-12

Dynamic Datacentre Applications: The placement problem


We're just in the wrap-up for Slider-0.80-incubating; A windows VM nearby is busy downloading the source .zip file and verifying that it builds & tests on windows.

Before people feel sorry for me consider this: given a choice between windows C++ code and debugging Kerberos,  I think I'd rather bring up MSDN in a copy of IE6 while playing edit-and-continue games in Visual Studio than try and debug the obscure error messages you get with kerberos on Java. Not only are they obscure, my latest Java 7 update event changed the text of the error which means "now we've upgraded the JVM you don't have the encryption settings to use kerberos".

Which is a shame, because I have to make sure everything works with kerberos.

Anyway: build works and the tests are running —which keeps me happy. If all goes to plan, Slider 0.80-incubating will be there for download within a week.

Some of the new features include
  • SLIDER-780: Ability to deploy docker packages
  • SLIDER-663 zero-package cluster definition. (It has a different name, but it essentially means "no need to build a redistributable zip file for each application). While the zip-based distribution is essential for things you want to share, for things you are developing or using yourself, this is lighter weight.
  • SLIDER-733 Ability to install a package on top of another one. This addresses "the coprocessor problem": how to add a new HBase coprocessor JAR without rebuilding everything. And, with SLIDER-633, you can define that new package easily,
Notably, all these features are by people other than me —specifically, by colleagues.

 slider team

As anyone who knows me will realise: that's because my Hortonworks colleagues are really great people who know more about computing than I ever have or will and are better at applying that knowledge than myself —someone who uses "test driven development" as a way of hiding his inability to get anything to work right first time.

And yet these people still allow me to work with them —showing there still things that they consider something I'm still up to handling.

What have I been up to? Placement, specifically SLIDER-611: Über-JIRA - placement phase 2

Right from the outset, Hadoop HDFS and MR have been built on some assumptions about failure, availability and bandwidth
  1. Disks will fail: deal with it by using replication over 2h replacement part support contracts and RAID-array rebuilding
  2. Servers will fail: app developers have to plan for that.
  3. Some servers are just unreliable; apps may work this out without even needing to be told. Example: stragglers during a map identifies servers whose disks may be in trouble.
  4. If an application fails a few times it's unreliable and should be killed.
  5. The placement/scheduling of work is primarily an optimisation to conserve bandwidth. That is: you place work for proximity to the desired data.
  6. If the desired placement cannot be obtained, placement elsewhere is usually acceptable.
  7. It's better to have work come up fast somewhere near the data than wait for minutes for it to potentially come up on the actual node requested.
  8. If work is placed on a different machine, the cost to other apps (i.e. the opportunity cost) is acceptable.
  9. It's OK if two containers get allocated to the same machines ("affinity is not considered harmful")
  10. The actions of one app running on a node are isolated from the others (e.g. CPU throttling and vmem limiting is sufficient to limit resource conflict).
  11. Work has a finite duration, and containers know when they are finished. Nothing else needs to make that decision for them.
For analytics works where the data is known, the placement strategy is ideal. It ensures that work comes up fast at the expense of locality: bandwidth is precious, but time even more so. a nearby-placement will run the work at a cost of network capacity and the implicit risk of locality placement for other work trying to run on the (now in use) node.

But what about long lived services, such as HBase and Kafka?

HBase uses short-circuit reads for maximum performance working with local data. Although region servers don't have to be co-located with the data, until there's a full Hbase compaction, most of the data for an RS is likely to be remote. (for each Block b, P(data-local) = f(nodes/3), (roughly; the 3-replica-2-rack policy complicates the equation as blocks are not spread completely randomly).

Therefore: restarting on the wrong node can slow down that region server's performance for an extended period of time.

HBase is often used low-latency apps; if other things are running there then it can impact performance. That is, you'd rather not have all the CPU + storage capacity taken up by lower priority work if that work tangibly impacted network and disk.

If you've configured the HBase rest and thrift servers with hard coded ports, they are at risk of conflicting for port numbers with other services.


Kafka? If you bring up a kafka instance on another node, all its local data is lost and it has to rebuild it from the last snapshot+stream. This is expensive; cost O(elapsed-time-since-snapshot * event arrival rate). Once rebuilt, performance recovers.

Kafka loves anti-affinity in placement; reduces the no. of instances that die on a node failure, and impact on the system until the rebuild is complete.

Both these apps then are things you want to deploy under YARN, but they have very different scheduling and placement requirements.

Long-lived services in general
  • May be willing to wait for an extended period to come up on the same server as before. That is, even if a server has crashed and is rebooting, or is down for a quick hardware fix —its better to wait before giving up.
  • But they can recover, and ultimately giving up is desireable.
  • Unreliability is not a simple metric of failures, it's failures in a recent time period that matters. That holds for the entire application, as well as distributed components.
  • Can fail in interesting ways.

With slider's goal "support long-lived services in YARN without making you rewrite them", we see that difference and get to address it.

YARN Labels
A key feature is in Hadoop 2.6: labels.  (YARN-796). Admins can give nodes labels; give queues shared/exclusive access to sets of labelled nodes, give users those rights.

Slider picks this up by allowing you to assign different components to different labels. The region servers in a production HBase cluster could all be given the property yarn.label=production;

We're using labels to isolate bits of the cluster for performance. They all share HDFS, so there is some cross-contamination, but IO-heavy analytics work can be kept off the nodes. We'd really like HDFS prioritisation for even better isolation, such as giving shortcut-reads priority over TCP traffic. Future work.

You can also split up a heterogenous cluster, with GPU or SSD labels, more RAM nodes etc. Even if an app isn't coded for label awareness, you can (in the Capacity Scheduler) get different queues to manage labels, so grant different users access to the nodes.

One thing that's interesting to consider is, in an EC2 cluster, labelling nodes as full vs spot-priced. You could place some work on spot-priced nodes, others on full. Not only does this give better guarantees of existence, if HDFS is only running on the full nodes, different performance characteristics. I'd be interested to know of any experiences here.

Placement Escalation.
SLIDER-799 AM to decide when to relax placement policy from specific host to rack/cluster

This  kept me busy in March; a fun piece of code. 

As mentioned earlier, YARN schedulers like to schedule work fast, even if non-local. An am can ask for "do-not-relax" placement, but then there's no relaxation even if a node never comes back.

What we've done is taken the choice about when to relax out of YARN's hands and into the AMs. By doing so, you can specify a time delay in minutes to hours, rather than relying on YARN to find a space and having it back off in a few tens of seconds at most.

This is easier to summarise than go through the details. For the curious, the logic to pick a location is in RoleHistory; escalation in OutstandingRequestTracker. Note that code is all part of our Model; we don't let that directly interact with YARN, which is something for what is controller's task. The model builds up a list of Operations which are then processed afterwards. This really helps testing: we can test the entire model through a mock YARN cluster, taking the actions and simulating their outcome, then add failure events, restarts, etc. Fast tests for fast dev cycles.

Node reliability tracking
We've had this for a while, not with explicit blacklisting but basic greylisting, building up a list of nodes we don't trust and never asking for them explicitly. What's changed is the sophistication of listing and how we react to it.
  1. We differentiate failure types; Node failure counters discard those which are node-independent (example: container memory limits exceeded), and those which are simply pre-emption events. (SLIDER-856).
  2. Similarly, component role failure counters don't count node failures or pre-emption in the reliability statistics of a role.
  3. The counters of role & node failures used for deciding, respectively if an app is failing or a node is unreliable , are reset on a regular, schedule basis (a few hours, tunable).
  4. If we don't trust a node, we don't ask for containers on it, even if is the last place where it ran. (Exception: if you declare that a component placement policy is "strict". It's always asked for again, and there is no escalation).
Reliability tracking should make a difference if a node is playing up. There's a lot more we could do here —we just have to be pragmatic and build things up as we go along.

Where next?

Absolutely key is anti-affinity. I don't see YARN-1042  coming soon —but that's OK. Now we do our own escalation, we can integrate that with anti affinity.

How? ask for a container at a time, blacklisting all those nodes where we've already got an instance. Ramp-up time will be slower, especially taking in to account that escalation may result in container allocations taking minutes before a dead/overloaded node is given up on.

Maybe it could be something like
  1. inital request: blacklist all but those we last ran on.
  2. escalation: relax to all but: nodes those with outstanding requests or allocated containers -or considered too unreliable.
  3. do this in parallel, discarding allocations which assign >1 instance to the same node.
  4. If, after a certain time, nodes are still unallocated, maybe consider relaxing restriction (as usual: configurable policy and timeouts per role) 
(for "nodes", read "nodes within the label set allowed for that role")

I need to think about this a bit more to see if it would work, estimate ramp-up times, etc.

Anything else?

Look at Über-JIRA : placement phase 3 for some ideas.

Otherwise, SLIDER-109, Detect and report application liveness. Agents could report in URLs they've built for liveness probes, either they check themselves or the AM hits them all on a schedule (with monitor threads designed to detect the probes themselves hanging). All the code for this is from the Hadoop 1 HA monitor work I did in 2012; it's checked in waiting to be wired up. All we need it someone to do the wiring. Which is where the fact that slider is an OSS project comes in to play.

  1. All of the implemented features listed here are available to anyone who wants to download and run Slider.
  2. All the new ideas are there for someone to implement. Join the team! Check out the source, enhance it, write those simulation tests, submit patches!
This is not just me being too busy; we need a broader developer community to keep slider going and get it out of incubation. The needs of the many, the code of the many, improves the product for all.

And it's really interesting. I could get distracted putting in time on this. Indeed, SLIDER-856 kept me busy two weekends ago to the extent that I got told off for irresponsible role modelling (parental, not slider codebase). Apparently spending a weekend in front of a monitor is a bad example for a teenage boy. But the placement problem is not just something I find interesting. Read the Borg paper and notice how they call out placement across failure domains and availability zones. They've hit the same problems. Add it to slider and collect real world data and you've got insight into scheduling and placing cluster workloads that even Google would be curious about.

So: come and play.

2015-04-18

Build tools as Proof Engines


Bristol@300mm

Someone has put up a thoughtful post on whether you can view make as a proof system. I thought about it and have come up with a conclusion: maybe in the past —but not any more.

As background, it's worth remembering
  1. I did "write the book on Ant", which is a build system in which you declare the transformational functions required to get your application into a state which you consider shippable/deployable, a set of functions which you define in a DAG for the Ant execution engine to generate and order from and then apply to your environment
  2. Early last year, in my spare time, I formally specified the Hadoop FS APIs in Spivey's Z notation, albeit in a Python syntax.
  3. In YARN-913 I went on to specify the desired behaviour of the YARN registry as a TLA+ specification.
And while I don't discuss it much, during my undergraduate work on Formal Specification of and implementation Microprocessors, I was using Gordon's HOL theorem prover. The latter based on Standard ML, just to fend off anyone who believed me when I was claiming not to understand functional programming earlier this week. I didn't meant it, it just entertained the audience. Oh, and did I mention I'm happy to write Prolog too?

This means that (a) I believe I understand about specifications, (b) I vaguely remember what a proof engine is, and (b) I understand how Prolog resolves things, and specifically, why "!" is the most useful operator when you are trying to write applications.

Now I note that the author of that first post, Bob Atkey, does not only has a background of formal logic, SML and possibly even worked with the same people as me, his knowledge is both greater and more up to date than mine. I just have more experience of breaking builds and getting emails from jenkins telling me this.

Now, consider a software project's build
  1. A set of source artifacts, S, containing artifacts s1..sn
  2. A declaration of the build process, B
  3. A set of external dependencies, libraries, L.
  4. A build environment, E., comprising the tools needed for the build, and the computer environment around them, including the filesystem, OS, etc.
The goal is to reach a desired state of a set of redistributables, R, such that you are prepared to ship or deploy them.

The purpose of the build system, then, is to generate R from S through a series of functions applied to (S, L) with tools T within the environment E. The build process description, B, defines or declares that process.

There's many ways to do that; a single line bash file cc *.c && cc *.o could be enough to compile the lexical analyser example from Ch02 of the dragon book.

Builds are more complex than that, which is where tools like make come in.

Make essentially declares that final list of redistributables, R, and a set of transformations from inputs artifacts to output artifacts, including the functions (actions by the tools) to generate the output artifacts from the input artifacts.

The make runtime looks at what artifacts exist, what are missing, and what are out of date somehow builds a chain of operations that are hypothesised to produce a complete set of output artifact whose timestamp in the filesystem is later than that of the source files.

It is interesting to look at it formally, with a rule saying that to derive .o from .c, a function "cc -c" is applied to the source. Make looks at the filesystem for the existence of that .o file, its absence or "out-of-dateness" and, if needed applies the function. If multiple rules are used to derive a sequence of transformations then make will build that chain then execute them.

One interesting question is "how does make build that series of functions, f1..fn, such that:

R = fn(fn-1(fn-2(fn-3..f1(S, L)

I believe it backchains from the redistributes  to build a series of rules which can be applied, then runs those rules forwards to create the final output.

If you view the final redistributables as a set of predicates whose existence is taken as truth, absence as false, and all rules are implies operators the define a path to truth, not falsehood (i.e. we are reasoning over Horn Clauses, then yes, I think you could say "make is a backward chaining system to build a transformation resulting in "truth".

The problem is this: I'm not trying to build something for the sake of building it. I'm trying to build something to ship. I'm running builds and tests repeatedly, from my laptop on my desk, from my laptop in a conference auditorium, a hotel room, and a train doing 190 mph between Brussels and London. A that's all just this week.

Most of the time, those tests have been failing. There are three possible (and non-disjoint) causes of this
  • (a) the null hypothesis: I can't write code that works.
  • (b) a secondary hypothesis: I am better at writing tests to demonstrate the production code is broken than I am at fixing the production code itself.
  • (c) as the external environment changes, so does the outcome of the build process.

Let's pretend that (a) and (b) are false; that I can actually write code that works first time, with the tests intended to show that this condition is not met being well written and correct themselves. Even if such a case held, my build would have been broken for a significant fraction of the time it was this week.

Here's some real examples of fun problems, for "type 3 fun" on the Ordnance Survey Fun Scale.
  1. Build halting as the particular sequence of operations it had chosen depended on maven artifacts which were not only absent, but non-retrievable from a train somewhere under the English Channel.
  2. Connection Reset exceptions talking to an in-VM webapp from within a test case. A test case that worked last week. I never did find the cause of this. Though I eventually concluded that it last worked before I installed a critical OS.X patch (last weeks, not this week's pending one). The obvious action was "reboot the mac' —and lo, it did fix it. I just spent a lot of time on hypotheses (a) and (b) before settling on cause (c)
  3. Tests hanging in the keynote sessions because while my laptop had an internet connection, GET requests against java.sun.com were failing. It turns out that when Jersey starts some servlets up, it tries to do a GET for some XSD file in order to validate some web.xml XML document. If the system is offline, it skips that. But if DNS resolves java.sun.com, then it does a GET and blocks until the document is returned or the GET fails. As as the internet in keynote was a bit overloaded, tests just hung. Fix: edit /etc/hosts to put java.sun.com == 127.0.0.1 or turn off the wifi.
  4. A build at my hotel failing as the run crossed the midnight marker and maven decided to pull down some -SNAPSHOT binaries from mvn central, which were not the binaries I'd just built locally, during the ongoing build and test run.
What do all these have in common? Differences in the environment of the build, primarily networking, except case (4), which was due to the build taking place at a different time from previous builds.

Which brings me to a salient point
The environment of a build includes not only the source files and build rules, it includes the rest of the Internet and the connections to it. furthermore, as the rule engine uses not just the presence/absence of intermediate and final artifacts as triggers for actions, time is an implicit part of the reasoning.

You made could make explict that temporal logic, and have a build tool which look at the timestamp of newly created files in /tmp and flagged up when your filesystem was in a different timezone (Oh, look ant -diagnostics does that! I wonder who wrote that probe?) But it wouldn't be enough, because we've reached a point in which builds are dependent upon state external to even the local machine.

Our builds are therefore, irretrievably nondeterministic.

2015-01-16

InfoSec risks of android travel applications


I was offline for 95% of the xmas break, instead investing my keyboard time into: (a) the exercises in Structure and Interpretation of Computer Programs and (b) writing some stuff on the implications of the Sony debacle for my home network security architecture.

I'm going to start posting the latter articles in an out-of-order sequence, with this post: InfoSec risks of android travel applications

Summary:
1. Airline checkin & Travel apps- demand so many privileges that you can't trust corporate calendar/contact data to stay on the devices. Nor, in the absence of audit logs, can you tell if the information has leaked.

2. Budget Airline applications are the least invasive, "premium" airlines demand access to confidential calendar info.

3. Even train timetable apps like to know things like your contact list.

However hard you lock down your network infrastructure, mandate 26 digit high-unicode passwords rolled monthly, mandate encrypted phones and pin-protected SIM cards, if those phones say "android" when they boot you can't be confident that sensitive corporate data isn't leaking out of those phones if the users expect to be able to use their phones to check on buses, trains or airplanes.

Introduction

Normally the fact that Android apps can ask and get near-unlimited data access is viewed as a privacy concern. It is for home users. Once you do any of the following, it becomes an InfoSec issue:
  • Synchronise calendar with a work email service.
  • Maintain a contact list which includes potentially confidential contact/customers
  • Bond to a work Wifi network which offers network access to HTTP(S) sites without some form of auth.
  • Do the same via VPN
What is fascinating is that apps asking for access to calendar info -especially "confidential event information"- is something that mainstream airline travel apps demand in exchange for giving you the ability to check in to a flight on your phone, look at schedules and your tickets. Android does not provide a way to directly prevent this.

Demands of Applications

Noticing that one application update needed to want more information than I was expected, I went through all the travel apps on my android phone and looked at what permissions they demanded. These weren't explicitly installed for the experiment, simply what I use to fly on airlines, and some train and bus ones in the UK. I'm excluding tripit on the basis that their web infrastructure requests (optional) access to your google emails to autoscan for trip plans, which is in a different league from these.


EntityCalendarContactsNetworkLocation
British Airwaysconfidential, participantsNoYesPrecise
United Airlinesconfidential, participantsNoYes; view network connectionsPrecise
EasyjetNoNoYesPrecise
RyanairNoNoYesPrecise
National RailAdd, modify, participantsNoYesPrecise
National Express CoachNoYesYes; view network connections & wifiPrecise
First Great Western trainsNoNoYesPrecise
trainlineNoNoYes; view network connectionsPrecise
First BusNoNoYes; view network connectionsPrecise

When you look at this list, its appalling. Why does the company that I use to get a bus to LHR need to know my contact list? Why does BA need my confidential appointment data? Why does the UK National Rail app need to be able to enumerate the calendar and send emails to participants without the owner's knowledge?

British Airways: wants access to confidential calendar info and full network access. What could possibly go wrong?



United: wants to call numbers, take photos and access confidential calendar info


National Express Bus Service
This is a bus company. How can they justify reading my contact list -business as well as personal?

 
UK National Rail
Pretty much total phone control, though not confidential appointment info. Are event titles considered confidential though?


Google's business model is built on knowing everything about your personal life -but this isn't about privacy, it is about preventing data leakage from an organisation. If anyone connects to your email services from an android, your airline checkin apps get to see the title, body and participants in all calendar appointments, whether that is "team meeting" or "plans for takeover of Walmart" where the participants include Jim Bezos and Donald Trump(*).

What could be done?
  1. Log accesses. I can't see a way to do this today, yet it would seem a core feature IT security teams would like to know. Without it you can't tell what information apps have read.
  2. Track provenance of calendar events and restrict calendar access only to events created by the airline apps themselves. This would require the servers to add event metadata; as google own gmail they could add a new BigTable column with ease.
  3. Restrict network access HTTPS sites on specific subdomains. Requiring HTTPS is good for general wifi security, and stops (most) organisations from playing DNS games to get behind the firewall.
Above and beyond that: allow users to easily restrict what privileges applications actually get. Don't want to give an app access to the contacts? Flip a switch and have the API call return an empty list. Want to block confidential calendar access? Another switch, another empty payload. Apple lets me do that with foreground/background location data collection on their devices through a simple list of apps, but google doesn't.

In the absence of that feature, if you want to be able to check in on your android phone on a non-budget airline, you have to give up expectations of the security of your confidential calendar data and contact list.

And in a world of BYOD, where the IT dept doesn't have control of the apps on a phone, that means they can't stop sensitive calendar/contact data leaking at all.

(*) FYI, there are no appointments in my calendar discussing taking over Walmart that include both Jim Besos and Donald Trump. I cannot confirm or deny any other meetings with these participants or plans for Walmart involving other participants. Ask British Airways or UAL if you don't believe me.

"It is not necessary to have experience of Hadoop"


I don't normally post LinkedIn approaches, especially from our competitors, but this one was so painful it blew my "do your research" criteria so dramatically it merits coverage.

FWIW my reply was: this is some kind of spoof, no?

Col du Barrioz


On 01/16/15 2:56 AM, Jessica <surname omitted to avoid embarrassment> wrote:    
--------------------
Hi Steve,

I hope you are well?

We are currently hiring at Cloudera to expand our Customer Operations Engineering team.

We are looking to build this team significantly over the coming months and this is a rare opportunity to become involved in Cloudera's Engineering department.

The role is home based with very little travel required (just for training).

We are looking for people with strong Linux backgrounds and good experience with programming languages. It is not necessary to have experience of Hadoop - we will teach you !!

For the chance to be part of this team please send me your CV to <email omitted to avoid embarrassment>@cloudera.com alternatively we can organise a time to speak for me to tell you more about the role?

Regards
Jessica

--------------------

As an aside, I am always curious why recruiter emails always start with "I hope you are well?".

a) We both know that the recruiter doesn't care about my health as long as it doesn't impact my ability to work with colleagues, customers and, once my training in Hadoop is complete, maybe even to learn understand how to use things like DelegationTokenAuthenticatedURL —that being what I am staring at right now.(*)

b) We both know that she doesn't actually want details like "well, the DVLA consider my neurological issues under control enough for me to drive again —even down to places like the Alps, and the ripped up tendon off my left kneecap is manageable enough for me to do hill work when I get there"

(*) If anyone has got Jersey+SPNEGO hooked up to UserGroupInformation, I would love that code.

2014-12-24

What have we learned this week?

The Sony/N Korea spat is fascinating in its implications

Ape of ninetree

  1. Never make an enemy of a nation state
  2. Any sufficiently large organisation is probably vulnerable to attack. I even worry about the attack surface of two adults and child, and I don't know who is the greater risk: the other adult or the child. The latter I am teaching foundational infosec to, primarily as he learns to break household security to boostrap access to infrastructure facilities to which he is denied access (e.g. the password to the 5GHz wifi network that doesn't go off at 21:00).
  3. Always encrypt HDDs with per-user keys. Any IT keys need to be locked down extra-hard.
  4. Never store passwords in plaintext files. At the very least, encrypt your word documents.
  5. Never email passwords to others. That goes for wifi passwords, incidentally, as children may come across unattended gmail inboxes and search for the words "wifi password"
  6. Never write anything in an email that you would be embarrassed to see public. Not confidential, simply unprofessional stuff that would make you look bad.
  7. The US considers a breach of security of a global organisation possibly by a nation state an act of terrorism.
The final one is something to call out. Nobody died here. It's cost money and has restricted the right of people round the world to watch something mediocre, but no lives were lost. Furthermore, and is salient, *it was not an attack on any government or national infrastructure*. This was not an attack on the US itself.

In comparison, the Olympic Games/Stuxnet attack on the Iranian nuclear enrichment facility was a deliberate, superbly executed attack on the Iranian government, to their "peaceful enrichment project"/stage 1 nuclear weapons program. That was a significantly more strategic asset than emails criticising Adam Sandler (*).

By inference, if an information-leak attack on a corporate entity is terrorism, mechanical sabotage of a nation's nuclear program must be viewed as an act of war.

That doesn't mean it wasn't justified, any less than the Israeli bombing of a Syrian facility a few years back. And at least here the government(s) in question did actually target a state building WMDs rather than invade one that didn't, leave it in a state of near-civil-war and so help create the mess we get today (**).

Yet what -someone- did, was commit an act of war of war against an other country, during "peacetime". And got away with it.

Which is profound. Whether it is an attack or Iranian nuclear infrastructure, or a data grab and dump at Sony, over-internet-warfare is something that is taking place today, in peacetime. It's the internet's equivalent of UAV attacks: small scale operations with no risk to the lives of your own-side, hence politically acceptable. Add in deniability and it is even better. Just as the suspects of the Olympic Games actions, apparently the US & Israel, deny that project while being happy with the outcome, so here can N. Korea say "we laud their actions even though we didn't do it"

Well, the US govt. probably set the precendet in Operation Olympic Games. It can't now look at what happened to Sony and say "this isn't fair" or "this is an act of war". As if it is, we are already at war with Iran -and before long, likely to be at war with other countries.

(*) Please can Team Netflix add a taste-preferences option that asks about specific actors, so i can say "never recommend anything by Adam Sandler" without having to 1-* everyone they throw up and so let it learn indirectly that I despise his work?

(**) On that topic, why is Tony Blair still a middle east peace envoy?