2017-11-06

I do not fear Kerberos, but I do fear Apple Itunes billing

I laugh at Kerberos messages. When I see a stack trace with a meaningless network error I go "that's interesting". I even learned PowerShell in a morning to fix where I'd managed to break our Windows build and tests.

But there is now one piece of software I do not ever want to approach, ever again. Apple icloud billing.

So far, since Saturday's warnings on my phone telling me that there was a billing problem
  1. Tried and repeatedly failed to update my card details
  2. Had my VISA card seemingly blocked by my bank,
  3. Been locked out of our Netflix subscription on account of them failing to bill a card which has been locked out by my may
  4. Had a chat line with someone on Apple online, who finally told me to phone an 800 number.
  5. Who are closed until office hours tomorrow
What am I trying to do? Set up iCloud family storage so I get a full resolution copy of my pics shared across devices, also give the other two members of our household lots of storage.

What have I achieved? Apart from a card lockout and loss of Netflix, nothing.

If this was a work problem I'd be loading debug level log files oftens of GB in editors, using regexps to delete all lines of noise, then trying to work backwards from the first stack trace in one process to where something in another system went awry. Not here though here I'm thinking "I don't need this". So if I don't get this sorted out by the end of the week, I won't be. I will have been defeated.

Last month I opted to pay £7/month for 2TB of iCloud storage. This not only looked great value for 2TB of storage, the fact I could share it with the rest of the family meant that we got a very good deal for all that data. And, with integration with iphotos, I could use to upload all my full resolution pictures. So sign up I did

My card is actually bonded to Bina's account, but here I set up the storage, had to reenter it. Where the fact that the dropdown menu switched to finnish was most amusing
finnish

With hindsight I should have taken "billing setup page cannot maintain consistency of locales between UI, known region of user, and menus" as a warning sign that something was broken.

Other than that, everything seemed to work. Photo upload working well. I don't yet keep my full photoset managed by iPhotos; it's long been a partitionedBy(year, month) directory tree built up with the now unmaintained Picasa, backed up at full res to our home server, at lower res to google photos. The iCloud experience seemed to be going smoothly; smoothly enough to think about the logistics of a full photo import. One factor there iCloud photos downloader works great as a way of downloading the full res images into the year/month layout, so I can pull images over to the server, so giving me backup and exit strategies.

That was on the Friday. On the Saturday a little alert pops up on the phone, matched by an email
Apple "we will take away all your photos"

Something has gone wrong. Well, no problem, over to billing. First, the phone UI. A couple of attempts and no, no joy. Over to the web page

Ths time, the menus are in german
appleID can't handle payment updates

"Something didn't work but we don't know what". Nice. Again? Same message.

Never mind, I recognise "PayPal" in german, lets try that:
And they can't handle paypal
No: failure.

Next attempt: use my Visa credit card, not the bank debit card I normally use. This *appears* to take. At least, I haven't got any more emails, and the photos haven't been deleted. All well to the limits of my observability.

Except, guess what ends up in my inbox instead? Netflix complaining about billing
Netflix "there was a problem"
Hypothesis: repeated failures of apple billing to set things up have caused the bank to lock down the card, it just so happens that Netflix bill the same day (does everyone do the first few days of each month?), and so: blocked off. That is, Apple Billing's issues are sufficient to break Netflix.

Over to the bank, review transactions, drop them a note.

My bank is fairly secure and uses 2FA with a chip-and-pin card inserted into a portable card reader. You can log in without it, but then cannot set up transfers to any new destination. I normally use the card reader and card. Not today though, signatures aren't being accepted. Solution, fall back to the "secrets" and then compose a message

Except of course, the first time I try that, it fails
And I can't talk to my bank about it

This is not a good day. Why can't I just have "Unknown failure at GSS API level". That I can handle. Instead what I am seeing here is a cross-service outage choreographed by Apple, which, if it really does take away my photos, will even go into my devices.

Solution: log out, log in. Compose the message in a text editor for ease of resubmission. Paste and submit. Off it goes.

Sunday: don't go near a computer. Phone still got a red marker "billing issues", though I can't distinguish from "old billing issues" from new billing issues. That is: no email to say things are fixed. At the same time, no emails to say "things are still broken". Same from netflix, neither a success message, or a failure one. Nothing from the bank either.

Monday: not worrying about this while working. No Kerberos errors there either. Today is a good day, apart from the thermostat on the ground floor not sending "turn the heating" on messages to the boiler, even after swapping the batteries.

After dinner, netflix. Except the TV has been logged out. Log in to netflix on the web and yes, my card is still not valid. Go to the bank, no response there yet. Go back to netflix, insert Visa credit card: its happy. This is good, as if this card started failing too, I'd be running out of functional payment mechanisms.

Now, what about apple?
apple id payment method; chrome
No, not english, or indeed, any language I know how to read. What now?

Apple support, in the form of a chat
Screen Shot 2017-11-06 at 21.10.51
After a couple of minutes wait I as talking to someone. I was a bit worried that the person I'm talking to was "allen". I know Allen. Sometimes he's helpful. Let's see.

After explaining my problem and sharing my appleId, Allen had a solution immediately: only the nominated owner of the family account can do the payment, even if the icloud storage account is in the name of another. So log in as them and try and sort stuff out there.

So: log out as me, long in as B., edit the billing. Which is the same card I've been using. Somehow, things went so wrong with Amazon billing trying to charge the system off my user ID and failing that I've been blocked everywhere. Solution: over to the VISA credit card. All "seems" well.

But how can I be sure? I've not got any emails from Apple Billing. The little alert in the settings window is gone, but I don't trust it. Without notification from Apple confirming that all is well, I have to assume that things are utterly broken. How can I trust a billing system which has managed to lock me out of my banking or netflix?

I raised this topic with Allen. After a bit of backwards and forwards, he gave me an 800 number to call. Which I did. They are closed after 19:00 hours, so I'll have to wait until tomorrow. I shall be calling them. I shall also be in touch with my bank.

Overall: this has been, so far, an utter disaster. Its not just that the system suffers from broken details (prompts in random languages), and deeply broken back ends (whose card is charged), but it manages to escalate the problem to transitively block out other parts of my online life.

If everything works tomorrow, I'll treat this as a transient disaster. If, on the other hand, things are not working tomorrow, I'm going to give up trying to maintain an iCloud storage account. I'll come up with some other solution. I just can't face having the billing system destroy the rest of my life.

2017-10-23

ROCA breaks my commit process

Since January I've been signing my git commits to the main Hadoop branches; along with Akira, Aaron and Allen we've been leading the way in trying to be more rigorous about authenticating our artifacts, in that gradual (and clearly futile) process to have some form of defensible INFOSEC policy on the development laptop (and ignoring homebrew, maven, sbt artifact installation,...).

For extra rigorousness, I've been using a Yubikey 4 for the signing: I don't have the secret key on my laptop *at all*, just the revocation secrets. To sign work, I use "git commit -S", the first commit of the day asks me to press the button and enter a PIN, from then on all I have to do to sign a commit is just press the button on the dongle plugged into a USB port on the monitor. Simple, seamless signing.

Yubikey rollout

Until Monday October 16, 2017.

There was some news in the morning about a WPA2 vulnerability. I looked at the summary and opted not to worry; the patch status of consumer electronics on the WLAN is more significant a risk than the WiFI password. No problem there, more a moral "never trust a network". As for a hinted at RSA vulnerability, it was going to inevitably be of two forms "utter disaster there's no point worrying about" or "hypothetical and irrelevant to most of us." Which is where I wasn't quite right.

Later on in the afternoon, glancing at fb on the phone and what I should I see but a message from facebook.

Your OpenPGP public key is weak. Please revoke it and generate a replacement

"Your OpenPGP public key is weak. Please revoke it and generate a replacement"

That's not the usual FB message. I go over to a laptop, log in to facebook and look at my settings: yes, I've pasted the public key in there. Not because I want encrypted comms with FB, but so that people can see if they really want to; part of my "publish the key broadly" program, as I've been trying to cross-sign/cross-trust other ASF committers' keys.

Then over to twitter and computing news sites and yes, there is a bug in a GPG keygen library used in SoC parts, from Estonian ID cards to Yubikeys like mine. And as a result, it is possible for someone to take my public key and generate the private one. While the vulnerability is public, the exact algorithm to regenerate the private key isn't so I have a bit of time left to kill my key. Which I do, and place an order for a replacement key (which has arrived)

And here's the problem. Git treats the revocation of a key as a sign that every single signature most now be untrusted.

Before, a one commit-per-line log of branch-2 --show-signature

git log --1show-signatures branch-2
After

git log1 --show-signature after revocation picked up

You see the difference? All my commits are now considered suspect. Anyone doing a log --show-signature will now actually get more warnings about the commits I signed than all those commits by other people which are not signed at all. Even worse, if someone were to ever try to do a full validation of the commit path at any time in the future is now going to see this. For the entire history of the git repo, those commits of mine are going to show up untrusted.

Given the way that git overreacts to key revocation, I didn't do this right.

What I should have done is simpler: force-expired the key by changing its expiry date the current date/time and pushing up the updated public key to the servers. As people update their keytabs from the servers, they'll see that the key isn't valid for signing new data, but that all commits issued by the user are marked as valid-at-the-time-but-with-an-expired-key. Key revocation would be reserved for the real emergency, "someone has my key and is actively using it".

I now have a new key, and will be rolling it out, This time I'm thinking ofr rolling my signing key every year, so that if I ever do have to revoke a key, it's only the last year's worth of commits which will be invalidated.

2017-09-09

Stocator: A High Performance Object Store Connector for Spark


Behind Picton Street

IBM have published a lovely paper on their Stocator 0-rename committer for Spark

Stocator is:
  1. An extended Swift client
  2. magic in their FS to redirect mkdir and file PUT/HEAD/GET calls under the normal MRv1 __temporary paths to new paths in the dest dir
  3. generating dest/part-0000 filenames using the attempt & task attempt ID to guarantee uniqueness and to ease cleanup: restarted jobs can delete the old attempts
  4. Commit performance comes from eliminating the COPY, which is O(data),
  5. And from tuning back the number of HTTP requests (probes for directories, mkdir 0 byte entries, deleting them)
  6. Failure recovery comes from explicit names of output files. (note, avoiding any saving of shuffle files, which this wouldn't work with...spark can do that in memory)
  7. They add summary data in the _SUCCESS file to list the files written & so work out what happened (though they don't actually use this data, instead relying on their swift service offering list consistency). (I've been doing something similar, primarily for testing & collection of statistics).

Page 10 has their benchmarks, all; of which are against an IBM storage system, not real amazon S3 with its different latencies and performance.

Table 5: Average run time


Read-Only 50GB
Read-Only 500GB
Teragen
Copy
Wordcount
Terasort
TPC-DS
Hadoop-Swift Base
37.80±0.48
393.10±0.92
624.60±4.00
622.10±13.52
244.10±17.72
681.90±6.10
101.50±1.50
S3a Base
33.30±0.42
254.80±4.00
699.50±8.40
705.10±8.50
193.50±1.80
746.00±7.20
104.50±2.20
Stocator
34.60±0.56
254.10±5.12
38.80±1.40
68.20±0.80
106.60±1.40
84.20±2.04
111.40±1.68
Hadoop-Swift Cv2
37.10±0.54
395.00±0.80
171.30±6.36
175.20±6.40
166.90±2.06
222.70±7.30
102.30±1.16
S3a Cv2
35.30±0.70
255.10±5.52
169.70±4.64
185.40±7.00
111.90±2.08
221.90±6.66
104.00±2.20
S3a Cv2 + FU
35.20±0.48
254.20±5.04
56.80±1.04
86.50±1.00
112.00±2.40
105.20±3.28
103.10±2.14

The S3a is the 2.7.x version, which has the stabilisation enough to be usable with Thomas Demoor's fast output stream (HADOOP-11183). That stream buffers in RAM & initiates the multipart upload once the block size threshold is reached. Provided you can upload data faster than you run out of RAM, it avoids the log waits at the end of close() calls, so has significant speedup. (The fast output stream has evolved into the S3ABlockOutput Stream (HADOOP-13560) which can buffer off heap and to HDD, and which will become the sole output stream once the great cruft cull of HADOOP-14738 goes in)

That means in the doc, "FU" == fast upload, == incremental upload & RAM storage. The default for S3A will become HDD storage, as unless you have a very fast pipe to a compatible S3 store, it's easy to overload the memory

Cv2 means MRv2 committer, the one which does  single rename operation on task commit (here the COPY), rather than one in task commit to promote that attempt, and then another in job commit to finalise the entire job. So only: one copy of every byte PUT, rather than 2, and the COPY calls can run in parallel, often off the critical path

 Table 6: Workload speedups when using Stocator



Read-Only 50GB
Read-Only 500GB
Teragen
Copy
Wordcount
Terasort
TPC-DS
Hadoop-Swift Base
x1.09
x1.55
x16.09
x9.12
x2.29
x8.10
x0.91
S3a Base
x0.96
x1.00
x18.03
x10.33
x1.82
x8.86
x0.94
Stocator
x1
x1
x1
x1
x1
x1
x1
Hadoop-Swift Cv2
x1.07
x1.55
x4.41
x2.57
x1.57
x2.64
x0.92
S3a Cv2
x1.02
x1.00
x4.37
x2.72
x1.05
x2.64
x0.93
S3a Cv2 + FU
x1.02
x1.00
x1.46
x1.27
x1.05
x1.25
x0.93


Their TCP-DS benchmarks show that stocator & swift is slower than TCP-DS Hadoop 2.7 S3a + Fast upload & MRv2 commit. Which means that (a) the Hadoop swift connector is pretty underperforming and (b) with fadvise=random and columnar data (ORC, Parquet) that speedup alone will give better numbers than swift & stocator. (Also shows how much the TCP-DS Benchmarks are IO heavy rather than output heavy the way the tera-x benchmarks are).

As the co-author of that original swift connector then, what the IBM paper is saying is "our zero rename commit just about compensates for the functional but utterly underperformant code Steve wrote in 2013 and gives us equivalent numbers to 2016 FS connectors by Steve and others, before they started the serious work on S3A speedup". Oh, and we used some of Steve's code to test it, removing the ASF headers.

Note that as the IBM endpoint is neither the classic python Openstack swift or Amazon's real S3, it won't exhibit the real issues those two have. Swift has the worst update inconsistency I've ever seen (i.e repeatable whenever I overwrote a large file with a smaller one), and aggressive throttling even of the DELETE calls in test teardown. AWS S3 has its own issues, not just in list inconsistency, but serious latency of HEAD/GET requests, as they always go through the S3 load balancers. That is, I would hope that IBM's storage offers significantly better numbers than you get over long-haul S3 connections. Although it'd be hard (impossible) to do a consistent test there, I 'd fear in-EC2 performance numbers to be actually worse than that measures.

I might post something faulting the paper, but maybe I'll should to do a benchmark of my new committer first. For now though, my critique of both the swift:// and s3a:// clients is as follows

Unless the storage services guarantees consistency of listing along with other operations, you can't use any of the MR commit algorithms to reliably commit work. So performance is moot. Here IBM do have a consistent store, so you can start to look at performance rather than just functionality. And as they note, committers which work with object store semantics are the way to do this: for operations like this you need the atomic operations of the store, not mocked operations in the client.

People who complain about the performance of using swift or s3a as a destination are blisfully unaware of the key issue: the risk of data loss due inconsistencies. Stocator solves both issues at once.

Anyway, means we should be planning a paper or two on our work too, maybe even start by doing something about random IO and object storage, as in "what can you do for and in columnar storage formats to make them work better in a world where a seek()+ read is potentially a new HTTP request."

(picture: parakeet behind Picton Street)






2017-05-22

Dissent is a right: Dissent is a duty. @Dissidentbot

It looks like the Russians interfered with the US elections, not just from the alleged publishing of the stolen emails, or through the alleged close links with the Trump campaign, but in the social networks, creating astroturfed campaigns and repeating the messages the country deemed important.

Now the UK is having an election. And no doubt the bots will be out. But if the Russians can do bots: so can I.

This then, is @dissidentbot.
Dissidentbot

Dissident bot is a Raspbery Pi running a 350 line ruby script tasked with heckling politicans
unrelated comments seem to work, if timely
It offers:
  • The ability to listen to tweets from a number of sources: currently a few UK politicians
  • To respond pick a random responses from a set of replies written explicitly for each one
  • To tweet the reply after a 20-60s sleep.
  • Admin CLI over Twitter Direct Messaging
  • Live update of response sets via github.
  • Live add/remove of new targets (just follow/unfollow from the twitter UI)
  • Ability to assign a probability of replying, 0-100
  • Random response to anyone tweeting about it when that is not a reply (disabled due to issues)
  • Good PUE numbers, being powered off the USB port of the wifi base station, SSD storage and fanless naturally cooled DC. Oh, and we're generating a lot of solar right now, so zero-CO2 for half the day.
It's the first Ruby script of more than ten lines I've ever written; interesting experience, and I've now got three chapters into a copy of the Pickaxe Book I've had sitting unloved alongside "ML for the working programmer".  It's nice to be able to develop just by saving the file & reloading it in the interpreter...not done that since I was Prolog programming. Refreshing.
Strong and Stable my arse
Without type checking its easy to ship code that's broken. I know, that's what tests are meant to find, but as this all depends on the live twitter APIs, it'd take effort, including maybe some split between Model and Control. Instead: broken the code into little methods I can run in the CLI.

As usual, the real problems surface once you go live:
  1. The bot kept failing overnight; nothing in the logs. Cause: its powered by the router and DD-WRT was set to reboot every night. Fix: disable.
  2. It's "reply to any reference which isn't a reply itself" doesn't work right. I think it's partly RT related, but not fully tracked it down.
  3. Although it can do a live update of the dissident.rb script, it's not yet restarting: I need to ssh in for that.
  4. I've been testing it by tweeting things myself, so I've been having to tweet random things during testing.
  5. Had to add handling of twitter blocking from too many API calls. Again: sleep a bit before retries.
  6. It's been blocked by the conservative party. That was because they've been tweeting 2-4 times/hour, and dissidentbot originally didn't have any jitter/sleep. After 24h of replying with 5s of their tweets, it's blocked.
The loopback code is the most annoying bug; nothing too serious though.

The DM CLI is nice, the fact that I haven't got live restart something which interferes with the workflow.
  Dissidentbot CLI via Twitter DM
Because the Pi is behind the firewall, I've no off-prem SSH access.

The fact the conservatives have blocked me, that's just amusing. I'll need another account.

One of the most amusing things is people argue with the bot. Even with "bot" in the name, a profile saying "a raspberry pi", people argue.
Arguing with Bots and losing

Overall the big barrier is content.  It turns out that you don't need to do anything clever about string matching to select the right tweet: random heckles seems to blend in. That's probably a metric of political debate in social media: a 350 line ruby script tweeting random phrases from a limited set is indistinguishable from humans.

I will accept Pull Requests of new content. Also: people are free to deploy their own copies. without the self.txt file it won't reply to any random mentions, just listen to its followed accounts and reply to those with a matching file in the data dir.

If the Russians can do it, so can we.

2017-05-15

The NHS gets 0wned

Friday's news was full of breaking panic about an "attack" on the NHS, making it sound like someone had deliberately made an attempt to get in there and cause damage.

It turns out that it wasn't an attack against the NHS itself, just a wide scale ransomware attack which combined click-through installation and intranet propagation by way of a vulnerability which the NSA had kept for internal use for some time.

Laptops, Lan ports and SICP

The NHS got decimated for a combination of issues:
  1. A massive intranet for SMB worms to run free.
  2. Clearly, lots of servers/desktops running the SMB protocol.
  3. One or more people reading an email with the original attack, bootstrapping the payload into the network.
  4. A tangible portion of the machines within some parts of the network running unpatched versions of Windows, clearly caused in part by the failure of successive governments to fund a replacement program while not paying MSFT for long-term support.
  5. Some of these systems within part of medical machines: MRI scanners, VO2 test systems, CAT scanners, whatever they use in the radiology dept —to name but some of the NHS machines I've been through in the past five years.
The overall combination then is: a large network/set of networks with unsecured, unpatched targets were vulnerable to a drive-by attack, the kind of attack, which, unlike a nation state itself, you may stand a chance of actually defending against.

What went wrong?

Issue 1: The intranet. Topic for another post.

Issue 2: SMB.

In servers this can be justified, though it's a shame that SMB sucks as a protocol. Desktops? It's that eternal problem: these things get stuck in as "features", but which sometimes come to burn you. Every process listening on a TCP or UDP port is a potential attack point. A 'netstat -a" will list running vulnerabilities on your system; enumerating running services "COM+, Sane.d? mDNS, ..." which you should review and decide whether they could be halted. Not that you can turn mDNS off on a macbook...

Issue 3: Email

With many staff, email clickthrough is a function of scale and probability: someone will, eventually. Probability always wins.

Issue 4: The unpatched XP boxes.

This is why Jeremy Hunt is in hiding, but it's also why our last Home Secretary, tasked with defending the nation's critical infrastructure, might want to avoid answering questions. Not that she is answering questions right now.

Finally, 5: The medical systems.

This is a complication on the "patch everything" story because every update to a server needs to be requalified. Why? Therac-25.

What's critical here is that the NHS was 0wned, not by some malicious nation state or dedicated hacker group: it fell victim to drive-by ransomware targeted at home users, small businesses, and anyone else with a weak INFOSEC policy This is the kind of thing that you do actually stand a chance of defending against, at least in the laptop, desktop and server.

Untitled

Defending against malicious nation state is probably near-impossible given physical access to the NHS network is trivial: phone up at 4am complaining of chest pains and you get a bed with a LAN port alongside it and told to stay there until there's a free slot in the radiology clinic.

What about the fact that the NSA had an exploit for the SMB vulnerability and were keeping quiet on it until the Shadow Brokers stuck up online? This is a complex issue & I don't know what the right answer is.

Whenever critical security patches go out, people try and reverse engineer them to get an attack which will work against unpatched versions of: IE, Flash, Java, etc. The problems here were:
  • the Shadow Broker upload included a functional exploit, 
  • it was over the network to enable worms, 
  • and it worked against widely deployed yet unsupported windows versions.
The cost of developing the exploit was reduced, and the target space vast, especially in a large organisation. Which, for a worm scanning and attacking vulnerable hosts, is a perfect breeding ground.

If someone else had found and fixed the patch, there'd still have been exploits out against it -the published code just made it easier and reduced the interval between patch and live exploit

The fact that it ran against an old windows version is also something which would have existed -unless MSFT were notified of the issue while they were still supporting WinXP. The disincentive for the NSA to disclose that is that a widely exploitable network attack is probably the equivalent of a strategic armament, one step below anything that can cut through a VPN and the routers, so getting you inside a network in the first place.

The issues we need to look at are
  1. How long is it defensible to hold on to an exploit like this?
  2. How to keep the exploit code secure during that period, while still using it when considered appropriate?
Here the MSFT "tomahawk" metaphor could be pushed a bit further. The US govt may have tomahawk missiles with nuclear payloads, but the ones they use are the low-damage conventional ones. That's what got out this time.

WMD in the Smithsonia

One thing that MSFT have to consider is: can they really continue with the "No more WinXP support" policy? I know they don't want to do it, the policy of making customers who care paying for the ongoing support is a fine way to do it, it's just it leaves multiple vulnerabilites. People at home, organisations without the money and who think "they won't be a target", and embedded systems everywhere -like a pub I visited last year whose cash registers were running Windows XP embedded; all those ATMs out there, etc, etc.

Windows XP systems are a de-facto part of the nation's critical infrastructure.

Having the UK and US governments pay for patches for the NHS and everyone else could be a cost effective way of securing a portion of the national infrastructure, for the NHS and beyond.

(Photos: me working on SICP during an unplanned five day stay and the Bristol Royal Infirmary. There's a LAN port above the bed I kept staring at; Windows XP Retail packaging, Smithsonian aerospace museum, the Mall, Washington DC)

2017-05-05

Is it time to fork Guava? Or rush towards Java 9?

Lost Crew WiP

Guava problems have surfaced again.

Hadoop 2.x has long-shipped Guava 14, though we have worked to ensure it runs against later versions, primarily by re-implementing our own classes of things pulled/moved across versions.


Hadoop trunk has moved up to Guava 21.0, HADOOP-10101.This has gone and overloaded the Preconditions.checkState() method, such that: if you compile against Guava 21, your code doesn't link against older versions of Guava. I am so happy about this I could drink some more coffee.

Classpaths are the gift that keeps on giving, and any bug report with the word "Guava" in it is inevitably going to be a mess. In contrast, Jackson is far more backwards compatible; the main problem there is getting every JAR in sync.

What to do?

Shade Guava Everywhere
This is going too be tricky to pull off. Andrew Wang has taken on this task. this is one of those low level engineering projects which doesn't have press-release benefits but which has the long-term potential to reduce pain. I'm glad someone else is doing it & will keep an eye on it.

Rush to use Java 9
I am so looking forward to this from an engineering perspective:

Pull Guava out
We could do our own Preconditions, our own VisibleForTesting attribute. More troublesome are the various cache classes, which do some nice things...hence they get used. That's a lot of engineering.

Fork Guava
We'd have to keep up to date with all new Guava features, while reinstating the bits they took away. The goal: stuff build with old Guava versions still works.

I'm starting to look at option four. Biggest issue: cost of maintenance.

There's also the fact that once we use our own naming "org.apache.hadoop/hadoop-guava-fork" then maven and ivy won't detect conflicting versions, and we end up with > 1 version of the guava JARs on the CP, and we've just introduced a new failure mode.

Java 9 is the one that has the best long-term potential, but at the same time, the time it's taken to move production clusters onto Java 8 makes it 18-24 months out at a minimum. Is that so bad though?

I actually created the "Move to Java 9": JIRA in 2014. It's been lurking there, Akira Ajisaka doing the equally unappreciated step-by-step movement towards it.

Maybe I should just focus some spare-review-time onto Java 9; see what's going on, review those patches and get them in. That would set things up for early adopters to move to Java 9, which, for in-cloud deployments, is something where people can be more agile and experimental.

(photo: someone painting down in Stokes Croft. Lost Crew tag)

2017-04-12

Mocking: an enemy of maintenance

Bristol spring

I'm keeping myself busy right now with HADOOP-13786, an O(1) committer for job output into S3 buckets. The classic filesystem relies on rename() for that, but against S3 rename is a file-by-file copy whose time is O(data) and whose failure mode is "a mess", amplified by the fact that an inconsistent FS can create the illusion that destination data hasn't yet been deleted: false conflict.
. This creates failures like SPARK-18512., FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A, as well as long commit delays.

I started this work a while back, making changes into the S3A Filesystem to support it. I've stopped focusing on that committer, and instead pulled in the version which Netflix have been using, which has the advantages of a thought out failure policy, and production testing. I've been busy merging that with the rest of the S3A work, and am now at the stage where I'm switching it over to the operations I've written for the first attempt, the "magic committer". These are in S3A, where they integrate with S3Guard state updates, instrumentation and metrics, retry logic, etc etc. All good.

The actual code to do the switchover is straightforward. What is taking up all my time is fixing the mock tests. These are failing with false positives "I've broken the code", when really the cause is "these mock tests are too brittle". In particular, I've had to rework how the tracking of operations goes, as a Mock Amazon S3Ciient is no longer used by the committer, instead its associated with the FS instance, which then is shared by all operations in a single test method. And the use of S3AFS methods shows up where its failing due to the mock instance not initing properly. I ended up spending most of Tuesday simply implementing the abort() call, now I'm doing the same on commit(). The production code switches fine, it's just the mock stuff.

This has really put me off mocking. I have used it sporadically in the past, and I've occasionally had to work other people's. Mocking has some nice features
  • Can run in unit tests which don't need AWS credentials, so Yetus/Jenkins can run them on patches.
  • Can be used to simulate failures and validate outcomes.
But the disadvantage is I just think they are too high maintenance. One test I've already migrated to being an integration test against an object store; I retained the original mock one, but just deleted that yesterday as it was going to be too expensive to migrate, and, with
that IT test, obsolete.

The others, well: the changes for abort() should help, but every new S3A method that gets called triggers new problems which I need to address. This is, well, "frustrating".

It's really putting me off mocking. Ignoring the Jenkins aspect, the key benefit is structure fault injection. I believe I could implement that in the IT tests too, at least in those tests which run in the same JVM. If I wanted to, I could probably even do it in the forked VMs by f propagating details on the desired failures to the processes. Or, if I really wanted to be devious, by running an HTTP proxy in the test VM and simulating network failures for the AWS client code itself to hit. That wouldn't catch all real-world problems (DNS, routing), but I could raise authentication, transient HTTP failures, and of course, force in listing inconsistencies. This is tempting, because it will help me qualify the AWS SDK we depend on, and could be re-used for testing the Azure storage too. Yes, it would take effort —but given the cost of maintaining those Mock tests after some minor refactoring of the production code, it's starting to look appealing.

(photo: Garage door, Greenbank, Bristol)