Following up from yesterday's post on the S3A committers, here's what you need for picking up the committers.
- Apache Hadoop trunk; builds to 3.1.0-SNAPSHOT:
- The documentation on use.
- An AWS keypair, try not to commit them to git. Tip for the Uber team: git-secrets is something you can add as a checkin hook. Do as I do: keep them elsewhere.
- If you want to use the magic committer; turn S3Guard on. Initially I'd use the staging committer, specificially the "directory" on.
- switch s3a:// to use that committer: fs.s3a.committer.name = partitioned
- Run your MR queries
- look in _SUCCESS for committer info. 0-bytes long: classic FileOutputCommitter. Bit of JSON naming committer, files committed and some metrics (SuccessData) and you are using an S3 committer.
If you can't get things to work because the docs are wrong: file a JIRA with a patch. If the code is wrong: submit a patch with the fix & tests.
- Spark Master has a couple of patches to deal with integration issues (FNFE on magic output paths, Parquet being over-fussy about committers, I think the committer binding has enough workarounds for these to work with Spark 2.2 though.
- Checkout my cloud-integration for Apache Spark repo, and its production-time redistributable, spark-cloud-integration.
- Read its docs and use
- If you want to use Parquet over other formats, use this committer.
- Again,. check _SUCCESS to see what's going on.
- There's a test module with various (scaleable) tests as well as a copy and paste of some of the Spark SQL test.
- Spark can work with the Partitioned committer. This is a staging committer which only worries about file conflicts in the final partitions. This lets you do in-situ updates of existing datasets, adding new partitions or overwriting existing ones, while leaving the rest alone. Hence: no need to move the output of a job into the reference datasets.
- Problems. File an issue. I've just seen Ewan has a couple of PRs I'd better look at, actually.
There's still other things there, like
- Cloud store optimised file input stream source.
- ParallizedWithLocalityRDD: and RDD which lets you provide custom functions to declare locality on a row-by-row basis. Used in my demo of implementing DistCp in Spark. Every row is a filename, which gets pushed out to a worker close to the data, it does the upload. This is very much a subset of distCP, but it shows this: you can have with with RDDs and cloud storage.
- + all the tests
(photo: spices on sale in a Mombasa market)