Steve Loughran: 2017-11

2017-11-23

How to play with the new S3A committers

Following up from yesterday's post on the S3A committers, here's what you need for picking up the committers.

Apache Hadoop trunk; builds to 3.1.0-SNAPSHOT:
The documentation on use.
An AWS keypair, try not to commit them to git. Tip for the Uber team: git-secrets is something you can add as a checkin hook. Do as I do: keep them elsewhere.
If you want to use the magic committer; turn S3Guard on. Initially I'd use the staging committer, specificially the "directory" on.
switch s3a:// to use that committer: fs.s3a.committer.name = partitioned
Run your MR queries
look in _SUCCESS for committer info. 0-bytes long: classic FileOutputCommitter. Bit of JSON naming committer, files committed and some metrics (SuccessData) and you are using an S3 committer.

If you do that: I'd like to see the numbers comparing FileOutputCommitter (which must have S3Guard) and the new committers. For benchmark consistency, leave S3Guard on.

If you can't get things to work because the docs are wrong: file a JIRA with a patch. If the code is wrong: submit a patch with the fix & tests.

Spark?

Spark Master has a couple of patches to deal with integration issues (FNFE on magic output paths, Parquet being over-fussy about committers, I think the committer binding has enough workarounds for these to work with Spark 2.2 though.
Checkout my cloud-integration for Apache Spark repo, and its production-time redistributable, spark-cloud-integration.
Read its docs and use
If you want to use Parquet over other formats, use this committer.
Again,. check _SUCCESS to see what's going on.
There's a test module with various (scaleable) tests as well as a copy and paste of some of the Spark SQL test.
Spark can work with the Partitioned committer. This is a staging committer which only worries about file conflicts in the final partitions. This lets you do in-situ updates of existing datasets, adding new partitions or overwriting existing ones, while leaving the rest alone. Hence: no need to move the output of a job into the reference datasets.
Problems. File an issue. I've just seen Ewan has a couple of PRs I'd better look at, actually.

Committer-wise, that spark-cloud-integration module is ultimately transient. I think we can identify those remaining issues with committer setup in spark core, after which a hadoop 3.0+ specific module should be able to work out the box with the new committers.

There's still other things there, like

Cloud store optimised file input stream source.
ParallizedWithLocalityRDD: and RDD which lets you provide custom functions to declare locality on a row-by-row basis. Used in my demo of implementing DistCp in Spark. Every row is a filename, which gets pushed out to a worker close to the data, it does the upload. This is very much a subset of distCP, but it shows this: you can have with with RDDs and cloud storage.
+ all the tests

I think maybe Apache Bahir would be the ultimate home for this. For now, a bit too unstable.

(photo: spices on sale in a Mombasa market)

2017-11-22

subatomic

I've just committed HADOOP-13786 Add S3A committer for zero-rename commits to S3 endpoints.. Contributed by Steve Loughran and Ryan Blue.

This is a serious and complex piece of work; I need to thank:

Thomas Demoor and Ewan Higgs from WDC for their advice and testing. They understand the intricacies of the S3 protocol to the millimetre.
Ryan Blue for his Staging-based S3 committer. The core algorithms and code will be in hadoop-aws come Hadoop 3.1.
Colleagues for their support, including the illustrious Sanjay Radia, and Ram Venkatesh for letting me put so much time into this.
Reviewers, especially Ryan Blue, Ewan Higgs, Mingliang Liu and extra especially Aaron Fabbri @ cloudera. It's a big piece of code to learn. First time a patch of mine has ever crossed the 1MB source barrier

I now understand a lot about commit protocols in Hadoop and Spark, including the history of interesting failures encountered, events which are reflected in the change logs of the relevant classes. Things you never knew about the Hadoop MapReduce commit protocol

The two different algorithms, v1 and v2 have very different semantics about the atomicity of task and job commits, including when output becomes visible in the destination directory.
Neither algorithm is atomic in both task and job commit.
V1 is atomic in task commits, but O(files) in its non-atomic job commit. It can recover from any job failure without having rerun all succeeded tasks, but not from a failure in job commit.
V2's job commit is a repeatable atomic O(1) operation, because it is a no-op. Task commits do the move/merge, which are O(file), make the output immediately visible, and as a consequence, mean that failure of a job means the output directory is in an unknown state.
Both algorithms depend on the filesystem having consistent listings and Create/Update/Delete operations
The routine to merge the output of a task to the destination is a real-world example of a co-recursive algorithm. These are so rare most developers don't even know the term for them -or have forgotten it.
At-most-once execution is guaranteed by having the tasks and AM failing when they recognise that they are in trouble.
The App Master refuses to commit a job if it hasn't had a heartbeat with the YARN Resource Manager within a specific time period. This stops it committing work if the network is partitioned and the AM/RM protocol fails...YARN may have considered the job dead and restarted it.
tasks commit iff they get permission from the AM; thus they will not attempt to commit if the network partitions.
if a task given permission to commit does not report a successful commit to the AM; the V1 algorithm can rerun the task; v2 must conclude its in an unknown state and abort the job.
Spark can commit using the Hadoop FileOutputCommitter; its Parquet support has some "special" code which refuses to work if the committer is not a subclass of ParquetOutputCommitter
It adds the ability for tasks to provide extra data to its job driver for use in job commit; this allows committers to explicitly pass commit information directly to the driver, rather than indirectly via the (consistent) filesystem.
Everyone's code assumes that abort() completes in a bounded time, and does not ever throw that IOException its signature promises it can.
There's lots of cruft in the MRv2 codebase to keep the MRv1 code alive, which would be really good to delete

This means I get to argue the semantics of commit algorithms with people, as I know what the runtimes "really do", rather than believed by everyone who has neither implemented part of it or stepped throught the code in a debugger.

If we had some TLA+ specifications of filesystems and object stores, we could perhaps write the algorithms as PlusCal examples, but that needs someone with the skills and the time. I'd have to find the time to learn TLA+ properly as well as specify everything, so it won't be me.

Returning to the committers, what do they do which is so special?

They upload task output to the final destination paths in the tasks, but don't make the uploads visible until the job is committed.

No renames, no copies, no job-commit-time merges, and no data visible until job commit. Tasks which fail/fail to commit do not have any adverse side effects on the destination directories.

First, read S3A Committers: Architecture and Implementation.

Then, if that seems interesting look at the source.

A key feature is that we've snuck in to FileOutputFormat a mechanism to allow you to provide different committers for different filesystem schemas.

Normal file output formats (i.e. not Parquet) will automatically get the committer for the target filesystems, which, for S3A, can be changed from the default FileOutputCommitter to an S3A-specific one. And any other object store which also offers delayed materialization of uploaded data can implement their own and run it alongside the S3A ones, which will be something to keep the Azure, GCS and openstack teams busy, perhaps.

For now though: users of Hadoop can use Amazon S3 (or compatible services) as the direct destination of Hadoop and Spark workloads without any overheads of copying the data, and the ability to support failure recovery and speculative execution. I'm happy with that as a good first step.

(photo: street vendors at the Kenya/Tanzania Border)

2017-11-06

I do not fear Kerberos, but I do fear Apple Itunes billing

I laugh at Kerberos messages. When I see a stack trace with a meaningless network error I go "that's interesting". I even learned PowerShell in a morning to fix where I'd managed to break our Windows build and tests.

But there is now one piece of software I do not ever want to approach, ever again. Apple icloud billing.

So far, since Saturday's warnings on my phone telling me that there was a billing problem

Tried and repeatedly failed to update my card details
Had my VISA card seemingly blocked by my bank,
Been locked out of our Netflix subscription on account of them failing to bill a card which has been locked out by my may
Had a chat line with someone on Apple online, who finally told me to phone an 800 number.
Who are closed until office hours tomorrow

What am I trying to do? Set up iCloud family storage so I get a full resolution copy of my pics shared across devices, also give the other two members of our household lots of storage.

What have I achieved? Apart from a card lockout and loss of Netflix, nothing.

If this was a work problem I'd be loading debug level log files oftens of GB in editors, using regexps to delete all lines of noise, then trying to work backwards from the first stack trace in one process to where something in another system went awry. Not here though here I'm thinking "I don't need this". So if I don't get this sorted out by the end of the week, I won't be. I will have been defeated.

Last month I opted to pay £7/month for 2TB of iCloud storage. This not only looked great value for 2TB of storage, the fact I could share it with the rest of the family meant that we got a very good deal for all that data. And, with integration with iphotos, I could use to upload all my full resolution pictures. So sign up I did

My card is actually bonded to Bina's account, but here I set up the storage, had to reenter it. Where the fact that the dropdown menu switched to finnish was most amusing

With hindsight I should have taken "billing setup page cannot maintain consistency of locales between UI, known region of user, and menus" as a warning sign that something was broken.

Other than that, everything seemed to work. Photo upload working well. I don't yet keep my full photoset managed by iPhotos; it's long been a partitionedBy(year, month) directory tree built up with the now unmaintained Picasa, backed up at full res to our home server, at lower res to google photos. The iCloud experience seemed to be going smoothly; smoothly enough to think about the logistics of a full photo import. One factor there iCloud photos downloader works great as a way of downloading the full res images into the year/month layout, so I can pull images over to the server, so giving me backup and exit strategies.

That was on the Friday. On the Saturday a little alert pops up on the phone, matched by an email

Apple "we will take away all your photos"

Something has gone wrong. Well, no problem, over to billing. First, the phone UI. A couple of attempts and no, no joy. Over to the web page

Ths time, the menus are in german

"Something didn't work but we don't know what". Nice. Again? Same message.

Never mind, I recognise "PayPal" in german, lets try that:

No: failure.

Next attempt: use my Visa credit card, not the bank debit card I normally use. This *appears* to take. At least, I haven't got any more emails, and the photos haven't been deleted. All well to the limits of my observability.

Except, guess what ends up in my inbox instead? Netflix complaining about billing

Hypothesis: repeated failures of apple billing to set things up have caused the bank to lock down the card, it just so happens that Netflix bill the same day (does everyone do the first few days of each month?), and so: blocked off. That is, Apple Billing's issues are sufficient to break Netflix.

Over to the bank, review transactions, drop them a note.

My bank is fairly secure and uses 2FA with a chip-and-pin card inserted into a portable card reader. You can log in without it, but then cannot set up transfers to any new destination. I normally use the card reader and card. Not today though, signatures aren't being accepted. Solution, fall back to the "secrets" and then compose a message

Except of course, the first time I try that, it fails

This is not a good day. Why can't I just have "Unknown failure at GSS API level". That I can handle. Instead what I am seeing here is a cross-service outage choreographed by Apple, which, if it really does take away my photos, will even go into my devices.

Solution: log out, log in. Compose the message in a text editor for ease of resubmission. Paste and submit. Off it goes.

Sunday: don't go near a computer. Phone still got a red marker "billing issues", though I can't distinguish from "old billing issues" from new billing issues. That is: no email to say things are fixed. At the same time, no emails to say "things are still broken". Same from netflix, neither a success message, or a failure one. Nothing from the bank either.

Monday: not worrying about this while working. No Kerberos errors there either. Today is a good day, apart from the thermostat on the ground floor not sending "turn the heating" on messages to the boiler, even after swapping the batteries.

After dinner, netflix. Except the TV has been logged out. Log in to netflix on the web and yes, my card is still not valid. Go to the bank, no response there yet. Go back to netflix, insert Visa credit card: its happy. This is good, as if this card started failing too, I'd be running out of functional payment mechanisms.

Now, what about apple?

No, not english, or indeed, any language I know how to read. What now?

Apple support, in the form of a chat

After a couple of minutes wait I as talking to someone. I was a bit worried that the person I'm talking to was "allen". I know Allen. Sometimes he's helpful. Let's see.

After explaining my problem and sharing my appleId, Allen had a solution immediately: only the nominated owner of the family account can do the payment, even if the icloud storage account is in the name of another. So log in as them and try and sort stuff out there.

So: log out as me, long in as B., edit the billing. Which is the same card I've been using. Somehow, things went so wrong with Amazon billing trying to charge the system off my user ID and failing that I've been blocked everywhere. Solution: over to the VISA credit card. All "seems" well.

But how can I be sure? I've not got any emails from Apple Billing. The little alert in the settings window is gone, but I don't trust it. Without notification from Apple confirming that all is well, I have to assume that things are utterly broken. How can I trust a billing system which has managed to lock me out of my banking or netflix?

I raised this topic with Allen. After a bit of backwards and forwards, he gave me an 800 number to call. Which I did. They are closed after 19:00 hours, so I'll have to wait until tomorrow. I shall be calling them. I shall also be in touch with my bank.

Overall: this has been, so far, an utter disaster. Its not just that the system suffers from broken details (prompts in random languages), and deeply broken back ends (whose card is charged), but it manages to escalate the problem to transitively block out other parts of my online life.

If everything works tomorrow, I'll treat this as a transient disaster. If, on the other hand, things are not working tomorrow, I'm going to give up trying to maintain an iCloud storage account. I'll come up with some other solution. I just can't face having the billing system destroy the rest of my life.