2018-01-10

Berlin Buzzwords: CFP with an offer of abstract review

Berlin Buzzwords CFP is open, which, along with Dataworks Summit in April, is going to make Berlin the place for technical conferences in 2018.
Berlin
As with last year, I'm offering to review people's abstracts before they're submitted; help edit them to get the text to be more in the style that reviewers to tend to go for.

When we review the talks, we look for interesting things in the themes of the conference, try and balance topics, pick the fun stuff. And we measure that (interesting, fun) on the prose of the submissions, knowing that they get turned into the program for the attendees: we want the text to be compelling for the audience.

The target audiences for submissions then are twofold. The ultimate audience is the attendees. The reviewers? We're the filter in the way.

But alongside that content, we want a diverse set of speakers, including people who have never spoken before. Otherwise it gets a bit repetitive (oh, no, stevel will talk on something random, again), and that's no good for the audience. But how do we regulars get in, given that the submission process is anonymous?

We do it by writing abstracts which we know the reviewers are looking for.

The review process, then, is a barrier to getting new speakers into the talk, which is dangerous: we all miss out on the insights from other people. And for the possible speakers, they miss out on the fun you have being a speaker at a conf, trying to get your slides together, discovering an hour in advance that you only have 20 min and not 30 for your talk and picking 1/3 of the slides to hide. Or on a trip to say, Boston, having your laptop have a hardware fault and you being grateful you snapshotted it onto a USB stick before you set off. Those are the bad points. The good bits? People coming up to you afterwards and getting into discussion about how they worked on similar stuff but came up with a better answer, how you learn back from the audience about related issues, how you can spend time in Berlin in cafes and wandering round, enjoying the city in early summer, sitting outside at restaurants with other developers from around Europe and the rest of the world, sharing pizza and beer in the the evening. Berlin is a fun place for conferences.

Which is why people should submit a talk, even if they've never presented before. And to help them, feel free to stick a draft up on google docs & then share with edit rights to my gmail address, steve.loughran@ ;  send me a note and I'll look through.

yes, I'm one of the reviewers, but in my reviews I call out that I helped with the submission: fairness is everything.

Last year only one person, Raam Rosh Hai, took this offer up, And he got in, with his talk How to build a recommendation system overnight! This means that so far, all drafts which have been through this pre-review of submissions process, has a 100% success rate. And, if you look at the video, you'll see its a good talk: he deserved that place.


Anyway, Submission deadline: Feb 14. Conference June 10-12.  Happy to help with reviewing draft abstracts.

2018-01-08

Trying to Meltdown in Java -failing. Probably

Meltdown has made for an "interesting" week in computing, as everyone is learning about/revising their knowledge of Speculative Execution. FWIW, I'd recommend the latest version of Patterson and Hennessey, Computer Architecture A Quantitative Approach. Not just for its details on speculative execution, but because it is the best book on microprocessor architecture and design that anyone has ever written, and lovely to read. I could read it repeatedly and not get bored.(And I see I need to get the 6th edition!)

Stokes Croft drugs find

This weekend, rather than read Patterson and Hennessey(*) I had a go to see if you could implement the meltdown attack in Java, hence in mapreduce, spark, or other non-native JAR

My initial attempt failed provided the part only speculates one branch in.

More specifically "the range checking Java does on all array accesses blocks the standard exploit given steve's assumptions". You can speculatively execute the out of bounds query, but you can't read the second array at an offset which will trigger $L1 cache loading.

If there's a way to do a reference to two separate memory locations which doesn't trigger branching range checks, then you stand a chance of pulling it off. I tried that using the ? : operator pair, something like

String ref = data ? refA : ref B;

which I hoped might compile down to something like


mov ref, refB
cmp data, 0
cmovnz ref, refB

This would do the move of the reference in the ongoing speculative branch, so, if "ref" was referenced in any way, trigger the resolution

In my experiment (2009 macbook pro with OSX Yosemite + latest java 8 early access release), a branch was generated ... but there are some refs in the open JDK JIRA to using CMOV, including the fact that hotspot compiler may be generating it if it things the probability of the move taking place is high enough.

Accordingly, I can't say "the hotspot compiler doesn't generate exploitable codepaths", only "in this experiment, the hotspot compiler didn't appear to generate an exploitable codepath".

Now the code is done, I might try on a Linux VM with Java 9 to see what is emitted
  1. If you can get the exploit in, then you'd have access to other bits of the memory space of the same JVM, irrespective of what the OS does. That means one thread with a set of Kerberos tickets could perhaps grab the secrets of another. IT'd be pretty hard, given the way the JVM manages objects on the heap: I wouldn't know where to begin, but it would become hypothetically possible.
  2. If you can get native code which you don't trust loaded into the JVM, then it can do whatever it wants. The original meltdown exploit is there. But native code running in JVM is going to have unrestricted access to the entire address space of the JVM -you don't need to use meltdown to grab secrets from the heap. All meltdown would do here is offer the possibility of grabbing kernel space data —which is what the OS patch does.

Anyway, I believe my first attempts failed within the context of this experiment.

Code-wise, this kept me busy on Sunday afternoon. I managed to twist my ankle quite badly on a broken paving stone on the way to patisserie on Saturday, so sat around for an hour drinking coffee in Stokes Croft, then limped home, with all forms of exercise crossed off the TODO list for the w/e. Time for a bit of Java coding instead, as a break for what I'd been doing over the holiday (C coding a version of Ping which outputs CSV data and a LaTeX paper on the S3A committers)

It took as much time trying get hold of the OS/X disassembler for generated code as it did coding the exploit. Why so? Oracle have replaced all links in Java.sun.net which would point to the reference dynamic library with a 302 to the base Java page telling you how lucky you are that Java is embedded in cars. Or you see a ref to on-stack-replacement on a page in Project Kenai, under a URL which starts with https://kenai.com/, point your browser there and end up on http://www.oracle.com/splash/kenai.com/decommissioning/index.html and the message "We're sorry the kenai.com site has closed."

All the history and knowledge on JVM internals and how to work there is gone. You can find the blog posts from four years ago on the topic, but the links to the tools are dead.

This is truly awful. It's the best argument I've seen for publishing this info as PDF files with DOI references, where you can move the artifact around, but citeseer will always find it. If the information doesn't last five years, then

The irony is, it means that because Oracle have killed all those inbound links to Java tools, they're telling the kind of developer who wants to know these things to go away. That's strategically short-sighted. I can understand why you'd want to keep the cost of site maintenance down, but really, breaking every single link? It's a major loss to the Java platform —especially as I couldn't even find a replacement.

I did manage to find a copy of the openjdk tarball people send you could D/L and run make on, but it was on a freebsd site, and even after a ./Configure && make, it broke trying to create a bsd dynlib. Then I checked out the full openjdk source tree, branch -8, installed the various tools and tried to build there. Again, some error. I ended up finding a copy of the needed hsdis-amd64.dylib library on Github, but I had to then spend some time looking at evolvedmicrobe's work &c to see if I could trust this to "probably" not be malware itself. I've replicated the JAR in the speculate module, BTW.

Anyway, once the disassembler was done and the other aspects of hotspot JIT compilation clear (if you can't see the method you wrote, run the loop a few thousand more times), I got to see some well annotated x86-64 assembler. Leaving me with a new problem: x86-64 assembler. It's a lot cleaner than classic 32 bit x86: having more registers does that, especially as it gives lots of scope for improving how function parameters and return values are managed.

What next? This is only a spare time bit of work, and now I'm back from my EU-length xmas break, I'm doing other things. Maybe next weekend I'll do some more. At least now I know that exploiting meltdown from the JVM is not going be straightforward.

Also I found it quite interesting playing with this, to see when the JVM kicks out native code, what it looks like. We code so far from the native hardware these days, its too "low level". But the new speculation-side-channel attacks have shown that you'd better understand modern CPU architectures, including how your high-level code gets compiled down.

I think I should submit a berlin buzzwords talk on this topic.

(*) It is traditional to swap the names of the author on every use. If you are a purist you have to remember the last order you used.

2018-01-04

Speculation


Speculative execution has been intel's strategy for keeping the x86 architecture alive since the P6/Pentium Pro part shipped in '95.

I remember coding explicitly for the P6 in a project in 1997; HPLabs was working with HP's IC Division to build their first CMOS-camera IC, which was an interesting problem. Suddenly your IC design needs to worry about light, aligning the optical colour filter with the sensors, making sure it all worked.

Eyeris

I ended up writing the code to capture the raw data at full frame rate, streaming to HDD, with an option to alternatively render it with/without the colour filtering (algorithms from another bit HPL team). Which means I get to nod knowingly when people complain about "raw" data. Yes, it's different for every device precisely because its raw.

The data rates of the VGA-resolution sensor via the PCI boards used to pull this off meant that a both cores of a multiprocessor P6 box were needed. It was the first time I'd ever had a dual socket system, but both sockets were full with the 150MHz parts and with careful work we could get away with the "full link rate" data capture which was a core part of the qualification process. It's not enough to self test the chips any more see, you need to look at the pictures.

Without too many crises, everything came together, which is why I have a framed but slightly skewed IC part to hand. And it's why I have memories of writing multithreaded windows C++ code with some of the core logic in x86 assembler. I also have memories of ripping out that ASM code as it turned out that it was broken, doing it as C pointer code and having it be just as fast. That's because: C code compiled to x86 by a good compiler, executed on a great CPU, is at least performant as hand-written x86 code by someone who isn't any good at assembler, and can be made to be correct more easily by the selfsame developer.

150 MHz may be a number people laugh at today, but the CPU:RAM clock ratios weren't as bad as they are today: cache misses are less expensive in terms of pipeline stalls, and those parts were fast. Why? Speculative and out of order execution, amongst other things
  1. The P6 could provisionally guess which way a branch was going to go, speculatively executing that path until it became clear whether or not the guess was correct -and then commit/abort that speculative code path.
  2. It uses a branch predictor to make that guess on the direction a branch was taken, based on the history of previous attempts, and a default option (FWIW, this is why I tend to place the most likely outcome first in my if() statements; tradition and superstition).
  3. It could execute operations out of order. That is, it's predecessor, the P5, was the last time mainstream intel desktop/server parts executed x86 code in the order the compiler generated them, or the human wrote them.
  4. register renaming meant that even though the parts had a limited set of registers, those OOO operations could reuse the same EAX, EBX, ECX registers without problems.
  5. It had caching to deal with the speed mismatch between that 150 MHz CPU & RAM.
  6. It supported dual CPU desktops, and I believe quad-CPU servers too. They'd be called "dual core" and "quad core" these days and looked down at.

Being the first multicore system I'd ever used, it was a learning experience. First was learning how too much windows NT4 code was still not stable in such a world. NTFS crashes with all all volumes corrupted? check. GDI rendering triggering kernel crash? check. And on a 4-core system I got hold of, everything crashed more often. Lesson: if you want a thread safe OS, give your kernel developers as many cores as you can.

OOO forced me to learn about the x86 memory model itself: barrier opcodes, when things could get reordered and when they wouldn't. Summary: don't try and be clever about synchronization, as your assumptions are invalid.

Speculation is always an unsatisfactory solution though. Every mis-speculation is lost cycles. And on a phone or laptop, that's wasted energy as much as time. And failed reads could fill up the cache with things you didn't want. I've tried to remember if I ever tried to use speculation to preload stuff if present, but doubt it. The CMOV command was a non-branching conditional assignment which was better, even if you had to hand code it.  The PIII/SSE added the PREFETCH opcode so you could a non-faulting hinted prefetch which you could stick into your non-branching code, but that was a niche opcode for people writing games/media codecs &c. And as Linus points out, what was clever for one CPU model turns out to be a stupid idea a generation later. (arguably, that applies to Itanium/IA-64, though as it didn't speculate, it doesn't suffer from the Spectre & Meltdown attacks).

Speculation, then: a wonderful use of transistors to compensate for how we developers write so many if() statements in our code. Wonderful, it kept the x86 line alive and so helped Intel deliver shareholder value and keep the RISC CPU out of the desktop, workstation and server businesses. Terrible because :"transistors" is another word for "CPU die area" with its yield equations and opportunity cost, and also for "wasted energy on failed speculations". If we wrote code which had fewer branches in it, and that got compiled down to CMOV opcodes, life would be better. But we have so many layers of indirection these days; so many indirect references to resolve before those memory accesses. Things are probably getting worse now, not better.

This week's speculation-side-channel attacks are fascinating then. These are very much architectural issues about speculation and branch prediction in general, rather than implementation details. Any CPU manufacturer whose parts do speculative execution has to be worried here, even if there's no evidence that your shipping parts aren't vulnerable to the current set of attacks. The whole point about speculation is to speed up operation based on the state of data held in registers or memory, so the time-to-execute is always going to be a side-channel providing information about the data used to make a branch.


The fact that you can get at kernel memory, even from code running under a hypervisor, means, well, a lot. It means that VMs running in cloud infrastructure could get at the data of the host OS and/or those of other VMs running on the same host (those S3 SSE-C keys you passed up to your VM? 0wned, along with your current set of IAM role credentials). It potentially means that someone else's code could be playing games with branch prediction to determine what codepaths your code is taking. Which, in public cloud infrastructure is pretty serious, as the only way to stop people running their code alongside yours is currently to pay for the top of the line VMs and hope they get a dedicated part. I'm not even sure that dedicated cores in a multicore CPU are sufficient isolation, not for anything related to cache-side-channel attacks (they should be good for branch prediction, I think, if the attacker can't manipulate the branch predictor of the other cores).

I can imagine the emails between cloud providers and CPU vendors being fairly strained, with the OEM/ODM teams on the CC: list. Even if the patches being rolled out mitigate things, if the slowdown on switching to kernelspace is as expensive as hinted, then that slows down applications, which means that the cost of running the same job in-cloud just got more expensive. Big cloud customers will be talking to their infrastructure suppliers on this, and then negotiating discounts for the extra CPU hours, which is a discount the cloud providers will expected to recover when they next buy servers. I feel as sorry for the cloud CPU account teams as I do for the x86 architecture group.

Meanwhile, there's an interesting set of interview questions you could ask developers on this topic.
  1. What does the generated java assembly for the Ival++ on a java long look like?
  2. What if the long is marked as volatile?
  3. What does the generated x86 assembler for a Java Optional<AtomicLong> opt.map(AtomicLong::addAndGet(1)) look like?
  4. What guarantees do you get about reordering?
  5. How would you write code which attempted to defend against speculation timing attacks?

I don't have the confidence to answer 1-4 myself, but I could at least go into detail about what I believed to be the case for 1-3; for #4 I should do some revision.

As for #5, defending. I would love to see what others suggest. Conditional CMOV ops could help against branch-prediction attacks, by eliminating the branches. However, searching for references to CMOV and the JDK turns up some issues which imply that branch prediction can sometimes be faster...", including "JDK-8039104. Don't use Math.min/max intrinsic on x86" it may be that even CMOV gets speculated on; with the CPU prefetching what is moved and keeping the write uncommitted until the state of the condition is known.

I suspect that the next edition of Hennessy and Patterson, "Computer Architecture, a Quantitative Approach" will be covering this topic.I shall look forward to with even greater anticipation than I have had for all the previous, beloved, versions.

As for all those people out there panicking about this, worrying if their nearly-new laptop is utterly exposed? You are running with Flash enabled on a laptop you use in cafe wifis without a VPN and with the same password, "k1tten",  you use for gmail and paypal. You have other issues.

2017-11-23

How to play with the new S3A committers

Untitled

Following up from yesterday's post on the S3A committers, here's what you need for picking up the committers.
  1. Apache Hadoop trunk; builds to 3.1.0-SNAPSHOT:  
  2. The documentation on use.
  3. An AWS keypair, try not to commit them to git. Tip for the Uber team: git-secrets is something you can add as a checkin hook. Do as I do: keep them elsewhere.
  4. If you want to use the magic committer; turn S3Guard on. Initially I'd use the staging committer, specificially the "directory" on.
  5. switch s3a:// to use that committer: fs.s3a.committer.name =  partitioned
  6. Run your MR queries
  7. look in _SUCCESS for committer info. 0-bytes long: classic FileOutputCommitter. Bit of JSON naming committer, files committed and some metrics (SuccessData) and you are using an S3 committer.
If you do that: I'd like to see the numbers comparing FileOutputCommitter (which must have S3Guard) and the new committers. For benchmark consistency, leave S3Guard on.

If you can't get things to work because the docs are wrong: file a JIRA with a patch. If the code is wrong: submit a patch with the fix & tests.

Spark?
  1. Spark Master has a couple of patches to deal with integration issues (FNFE on magic output paths, Parquet being over-fussy about committers, I think the committer binding has enough workarounds for these to work with Spark 2.2 though.
  2. Checkout my cloud-integration for Apache Spark repo, and its production-time redistributable, spark-cloud-integration.
  3. Read its docs and use
  4. If you want to use Parquet over other formats, use this committer.  
  5. Again,. check _SUCCESS to see what's going on.
  6. There's a test module with various (scaleable) tests as well as a copy and paste of some of the Spark SQL test.
  7. Spark can work with the Partitioned committer. This is a staging committer which only worries about file conflicts in the final partitions. This lets you do in-situ updates of existing datasets, adding new partitions or overwriting existing ones, while leaving the rest alone. Hence: no need to move the output of a job into the reference datasets.
  8. Problems. File an issue. I've just seen Ewan has a couple of PRs I'd better look at, actually.
Committer-wise, that spark-cloud-integration module is ultimately transient. I think we can identify those remaining issues with committer setup in spark core, after which a hadoop 3.0+ specific module should be able to work out the box with the new committers.

There's still other things there, like
  • Cloud store optimised file input stream source
  • ParallizedWithLocalityRDD: and RDD which lets you provide custom functions to declare locality on a row-by-row basis. Used in my demo of implementing DistCp in Spark. Every row is a filename, which gets pushed out to a worker close to the data, it does the upload. This is very much a subset of distCP, but it shows this: you can have with with RDDs and cloud storage.
  • + all the tests
I think maybe Apache Bahir would be the ultimate home for this. For now, a bit too unstable.

(photo: spices on sale in a Mombasa market)

2017-11-22

subatomic

Untitled

I've just committed HADOOP-13786 Add S3A committer for zero-rename commits to S3 endpoints.. Contributed by Steve Loughran and Ryan Blue.

This is a serious and complex piece of work; I need to thank:
  1. Thomas Demoor and Ewan Higgs from WDC for their advice and testing. They understand the intricacies of the S3 protocol to the millimetre.
  2. Ryan Blue for his Staging-based S3 committer. The core algorithms and code will be in hadoop-aws come Hadoop 3.1.
  3. Colleagues for their support, including the illustrious Sanjay Radia, and Ram Venkatesh for letting me put so much time into this.
  4. Reviewers, especially Ryan Blue, Ewan Higgs, Mingliang Liu and extra especially Aaron Fabbri @ cloudera. It's a big piece of code to learn. First time a patch of mine has ever crossed the 1MB source barrier
I now understand a lot about commit protocols in Hadoop and Spark, including the history of interesting failures encountered, events which are reflected in the change logs of the relevant classes. Things you never knew about the Hadoop MapReduce commit protocol
  1. The two different algorithms, v1 and v2 have very different semantics about the atomicity of task and job commits, including when output becomes visible in the destination directory.
  2. Neither algorithm is atomic in both task and job commit.
  3. V1 is atomic in task commits, but O(files) in its non-atomic job commit. It can recover from any job failure without having rerun all succeeded tasks, but not from a failure in job commit.
  4. V2's job commit is a repeatable atomic O(1) operation, because it is a no-op. Task commits do the move/merge, which are O(file), make the output immediately visible, and as a consequence, mean that failure of a job means the output directory is in an unknown state.
  5. Both algorithms depend on the filesystem having consistent listings and Create/Update/Delete operations
  6. The routine to merge the output of a task to the destination is a real-world example of a co-recursive algorithm. These are so rare most developers don't even know the term for them -or have forgotten it.
  7. At-most-once execution is guaranteed by having the tasks and AM failing when they recognise that they are in trouble.
  8. The App Master refuses to commit a job if it hasn't had a heartbeat with the YARN Resource Manager within a specific time period. This stops it committing work if the network is partitioned and the AM/RM protocol fails...YARN may have considered the job dead and restarted it.
  9. tasks commit iff they get permission from the AM; thus they will not attempt to commit if the network partitions.
  10. if a task given permission to commit does not report a successful commit to the AM; the V1 algorithm can rerun the task; v2 must conclude its in an unknown state and abort the job.
  11. Spark can commit using the Hadoop FileOutputCommitter; its Parquet support has some "special" code which refuses to work if the committer is not a subclass of ParquetOutputCommitter
  12. . That is: it's special code makes it the hardest thing to bind to this: ORC, CSV, Avro: they all work out the box.
  13. It adds the ability for tasks to provide extra data to its job driver for use in job commit; this allows committers to explicitly pass commit information directly to the driver, rather than indirectly via the (consistent) filesystem.
  14. Everyone's code assumes that abort() completes in a bounded time, and does not ever throw that IOException its signature promises it can.
  15. There's lots of cruft in the MRv2 codebase to keep the MRv1 code alive, which would be really good to delete
This means I get to argue the semantics of commit algorithms with people, as I know what the runtimes "really do", rather than believed by everyone who has neither implemented part of it or stepped throught the code in a debugger.

If we had some TLA+ specifications of filesystems and object stores, we could perhaps write the algorithms as PlusCal examples, but that needs someone with the skills and the time. I'd have to find the time to learn TLA+ properly as well as specify everything, so it won't be me.

Returning to the committers, what do they do which is so special?

They upload task output to the final destination paths in the tasks, but don't make the uploads visible until the job is committed.

No renames, no copies, no job-commit-time merges, and no data visible until job commit. Tasks which fail/fail to commit do not have any adverse side effects on the destination directories.

First, read S3A Committers: Architecture and Implementation.

Then, if that seems interesting look at the source.

A key feature is that we've snuck in to FileOutputFormat a mechanism to allow you to provide different committers for different filesystem schemas.

Normal file output formats (i.e. not Parquet) will automatically get the committer for the target filesystems, which, for S3A, can be changed from the default FileOutputCommitter to an S3A-specific one. And any other object store which also offers delayed materialization of uploaded data can implement their own and run it alongside the S3A ones, which will be something to keep the Azure, GCS and openstack teams busy, perhaps.

For now though: users of Hadoop can use Amazon S3 (or compatible services) as the direct destination of Hadoop and Spark workloads without any overheads of copying the data, and the ability to support failure recovery and speculative execution. I'm happy with that as a good first step.

(photo: street vendors at the Kenya/Tanzania Border)

2017-11-06

I do not fear Kerberos, but I do fear Apple Itunes billing

I laugh at Kerberos messages. When I see a stack trace with a meaningless network error I go "that's interesting". I even learned PowerShell in a morning to fix where I'd managed to break our Windows build and tests.

But there is now one piece of software I do not ever want to approach, ever again. Apple icloud billing.

So far, since Saturday's warnings on my phone telling me that there was a billing problem
  1. Tried and repeatedly failed to update my card details
  2. Had my VISA card seemingly blocked by my bank,
  3. Been locked out of our Netflix subscription on account of them failing to bill a card which has been locked out by my may
  4. Had a chat line with someone on Apple online, who finally told me to phone an 800 number.
  5. Who are closed until office hours tomorrow
What am I trying to do? Set up iCloud family storage so I get a full resolution copy of my pics shared across devices, also give the other two members of our household lots of storage.

What have I achieved? Apart from a card lockout and loss of Netflix, nothing.

If this was a work problem I'd be loading debug level log files oftens of GB in editors, using regexps to delete all lines of noise, then trying to work backwards from the first stack trace in one process to where something in another system went awry. Not here though here I'm thinking "I don't need this". So if I don't get this sorted out by the end of the week, I won't be. I will have been defeated.

Last month I opted to pay £7/month for 2TB of iCloud storage. This not only looked great value for 2TB of storage, the fact I could share it with the rest of the family meant that we got a very good deal for all that data. And, with integration with iphotos, I could use to upload all my full resolution pictures. So sign up I did

My card is actually bonded to Bina's account, but here I set up the storage, had to reenter it. Where the fact that the dropdown menu switched to finnish was most amusing
finnish

With hindsight I should have taken "billing setup page cannot maintain consistency of locales between UI, known region of user, and menus" as a warning sign that something was broken.

Other than that, everything seemed to work. Photo upload working well. I don't yet keep my full photoset managed by iPhotos; it's long been a partitionedBy(year, month) directory tree built up with the now unmaintained Picasa, backed up at full res to our home server, at lower res to google photos. The iCloud experience seemed to be going smoothly; smoothly enough to think about the logistics of a full photo import. One factor there iCloud photos downloader works great as a way of downloading the full res images into the year/month layout, so I can pull images over to the server, so giving me backup and exit strategies.

That was on the Friday. On the Saturday a little alert pops up on the phone, matched by an email
Apple "we will take away all your photos"

Something has gone wrong. Well, no problem, over to billing. First, the phone UI. A couple of attempts and no, no joy. Over to the web page

Ths time, the menus are in german
appleID can't handle payment updates

"Something didn't work but we don't know what". Nice. Again? Same message.

Never mind, I recognise "PayPal" in german, lets try that:
And they can't handle paypal
No: failure.

Next attempt: use my Visa credit card, not the bank debit card I normally use. This *appears* to take. At least, I haven't got any more emails, and the photos haven't been deleted. All well to the limits of my observability.

Except, guess what ends up in my inbox instead? Netflix complaining about billing
Netflix "there was a problem"
Hypothesis: repeated failures of apple billing to set things up have caused the bank to lock down the card, it just so happens that Netflix bill the same day (does everyone do the first few days of each month?), and so: blocked off. That is, Apple Billing's issues are sufficient to break Netflix.

Over to the bank, review transactions, drop them a note.

My bank is fairly secure and uses 2FA with a chip-and-pin card inserted into a portable card reader. You can log in without it, but then cannot set up transfers to any new destination. I normally use the card reader and card. Not today though, signatures aren't being accepted. Solution, fall back to the "secrets" and then compose a message

Except of course, the first time I try that, it fails
And I can't talk to my bank about it

This is not a good day. Why can't I just have "Unknown failure at GSS API level". That I can handle. Instead what I am seeing here is a cross-service outage choreographed by Apple, which, if it really does take away my photos, will even go into my devices.

Solution: log out, log in. Compose the message in a text editor for ease of resubmission. Paste and submit. Off it goes.

Sunday: don't go near a computer. Phone still got a red marker "billing issues", though I can't distinguish from "old billing issues" from new billing issues. That is: no email to say things are fixed. At the same time, no emails to say "things are still broken". Same from netflix, neither a success message, or a failure one. Nothing from the bank either.

Monday: not worrying about this while working. No Kerberos errors there either. Today is a good day, apart from the thermostat on the ground floor not sending "turn the heating" on messages to the boiler, even after swapping the batteries.

After dinner, netflix. Except the TV has been logged out. Log in to netflix on the web and yes, my card is still not valid. Go to the bank, no response there yet. Go back to netflix, insert Visa credit card: its happy. This is good, as if this card started failing too, I'd be running out of functional payment mechanisms.

Now, what about apple?
apple id payment method; chrome
No, not english, or indeed, any language I know how to read. What now?

Apple support, in the form of a chat
Screen Shot 2017-11-06 at 21.10.51
After a couple of minutes wait I as talking to someone. I was a bit worried that the person I'm talking to was "allen". I know Allen. Sometimes he's helpful. Let's see.

After explaining my problem and sharing my appleId, Allen had a solution immediately: only the nominated owner of the family account can do the payment, even if the icloud storage account is in the name of another. So log in as them and try and sort stuff out there.

So: log out as me, long in as B., edit the billing. Which is the same card I've been using. Somehow, things went so wrong with Amazon billing trying to charge the system off my user ID and failing that I've been blocked everywhere. Solution: over to the VISA credit card. All "seems" well.

But how can I be sure? I've not got any emails from Apple Billing. The little alert in the settings window is gone, but I don't trust it. Without notification from Apple confirming that all is well, I have to assume that things are utterly broken. How can I trust a billing system which has managed to lock me out of my banking or netflix?

I raised this topic with Allen. After a bit of backwards and forwards, he gave me an 800 number to call. Which I did. They are closed after 19:00 hours, so I'll have to wait until tomorrow. I shall be calling them. I shall also be in touch with my bank.

Overall: this has been, so far, an utter disaster. Its not just that the system suffers from broken details (prompts in random languages), and deeply broken back ends (whose card is charged), but it manages to escalate the problem to transitively block out other parts of my online life.

If everything works tomorrow, I'll treat this as a transient disaster. If, on the other hand, things are not working tomorrow, I'm going to give up trying to maintain an iCloud storage account. I'll come up with some other solution. I just can't face having the billing system destroy the rest of my life.

2017-10-23

ROCA breaks my commit process

Since January I've been signing my git commits to the main Hadoop branches; along with Akira, Aaron and Allen we've been leading the way in trying to be more rigorous about authenticating our artifacts, in that gradual (and clearly futile) process to have some form of defensible INFOSEC policy on the development laptop (and ignoring homebrew, maven, sbt artifact installation,...).

For extra rigorousness, I've been using a Yubikey 4 for the signing: I don't have the secret key on my laptop *at all*, just the revocation secrets. To sign work, I use "git commit -S", the first commit of the day asks me to press the button and enter a PIN, from then on all I have to do to sign a commit is just press the button on the dongle plugged into a USB port on the monitor. Simple, seamless signing.

Yubikey rollout

Until Monday October 16, 2017.

There was some news in the morning about a WPA2 vulnerability. I looked at the summary and opted not to worry; the patch status of consumer electronics on the WLAN is more significant a risk than the WiFI password. No problem there, more a moral "never trust a network". As for a hinted at RSA vulnerability, it was going to inevitably be of two forms "utter disaster there's no point worrying about" or "hypothetical and irrelevant to most of us." Which is where I wasn't quite right.

Later on in the afternoon, glancing at fb on the phone and what I should I see but a message from facebook.

Your OpenPGP public key is weak. Please revoke it and generate a replacement

"Your OpenPGP public key is weak. Please revoke it and generate a replacement"

That's not the usual FB message. I go over to a laptop, log in to facebook and look at my settings: yes, I've pasted the public key in there. Not because I want encrypted comms with FB, but so that people can see if they really want to; part of my "publish the key broadly" program, as I've been trying to cross-sign/cross-trust other ASF committers' keys.

Then over to twitter and computing news sites and yes, there is a bug in a GPG keygen library used in SoC parts, from Estonian ID cards to Yubikeys like mine. And as a result, it is possible for someone to take my public key and generate the private one. While the vulnerability is public, the exact algorithm to regenerate the private key isn't so I have a bit of time left to kill my key. Which I do, and place an order for a replacement key (which has arrived)

And here's the problem. Git treats the revocation of a key as a sign that every single signature most now be untrusted.

Before, a one commit-per-line log of branch-2 --show-signature

git log --1show-signatures branch-2
After

git log1 --show-signature after revocation picked up

You see the difference? All my commits are now considered suspect. Anyone doing a log --show-signature will now actually get more warnings about the commits I signed than all those commits by other people which are not signed at all. Even worse, if someone were to ever try to do a full validation of the commit path at any time in the future is now going to see this. For the entire history of the git repo, those commits of mine are going to show up untrusted.

Given the way that git overreacts to key revocation, I didn't do this right.

What I should have done is simpler: force-expired the key by changing its expiry date the current date/time and pushing up the updated public key to the servers. As people update their keytabs from the servers, they'll see that the key isn't valid for signing new data, but that all commits issued by the user are marked as valid-at-the-time-but-with-an-expired-key. Key revocation would be reserved for the real emergency, "someone has my key and is actively using it".

I now have a new key, and will be rolling it out, This time I'm thinking ofr rolling my signing key every year, so that if I ever do have to revoke a key, it's only the last year's worth of commits which will be invalidated.