I'm keeping myself busy right now with HADOOP-13786, an O(1) committer for job output into S3 buckets. The classic filesystem relies on rename() for that, but against S3 rename is a file-by-file copy whose time is O(data) and whose failure mode is "a mess", amplified by the fact that an inconsistent FS can create the illusion that destination data hasn't yet been deleted: false conflict.
. This creates failures like SPARK-18512., FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A, as well as long commit delays.
I started this work a while back, making changes into the S3A Filesystem to support it. I've stopped focusing on that committer, and instead pulled in the version which Netflix have been using, which has the advantages of a thought out failure policy, and production testing. I've been busy merging that with the rest of the S3A work, and am now at the stage where I'm switching it over to the operations I've written for the first attempt, the "magic committer". These are in S3A, where they integrate with S3Guard state updates, instrumentation and metrics, retry logic, etc etc. All good.
The actual code to do the switchover is straightforward. What is taking up all my time is fixing the mock tests. These are failing with false positives "I've broken the code", when really the cause is "these mock tests are too brittle". In particular, I've had to rework how the tracking of operations goes, as a Mock Amazon S3Ciient is no longer used by the committer, instead its associated with the FS instance, which then is shared by all operations in a single test method. And the use of S3AFS methods shows up where its failing due to the mock instance not initing properly. I ended up spending most of Tuesday simply implementing the abort() call, now I'm doing the same on commit(). The production code switches fine, it's just the mock stuff.
This has really put me off mocking. I have used it sporadically in the past, and I've occasionally had to work other people's. Mocking has some nice features
- Can run in unit tests which don't need AWS credentials, so Yetus/Jenkins can run them on patches.
- Can be used to simulate failures and validate outcomes.
that IT test, obsolete.
The others, well: the changes for abort() should help, but every new S3A method that gets called triggers new problems which I need to address. This is, well, "frustrating".
It's really putting me off mocking. Ignoring the Jenkins aspect, the key benefit is structure fault injection. I believe I could implement that in the IT tests too, at least in those tests which run in the same JVM. If I wanted to, I could probably even do it in the forked VMs by f propagating details on the desired failures to the processes. Or, if I really wanted to be devious, by running an HTTP proxy in the test VM and simulating network failures for the AWS client code itself to hit. That wouldn't catch all real-world problems (DNS, routing), but I could raise authentication, transient HTTP failures, and of course, force in listing inconsistencies. This is tempting, because it will help me qualify the AWS SDK we depend on, and could be re-used for testing the Azure storage too. Yes, it would take effort —but given the cost of maintaining those Mock tests after some minor refactoring of the production code, it's starting to look appealing.
(photo: Garage door, Greenbank, Bristol)