One thing I've been working on with my colleagues is improving performance of Hadoop, Hive and Spark against S3, one exists() or getFileStatus() call at a time.
Why? This is a log of a test run showing how long it takes to query S3 over a long haul link. This is midway through the test, so the HTTPS connection pool is up, DNS has already resolved the hostnames. So these should be warm links to S3 US-east. Yet it takes over a second just for one probe.
2016-12-01 15:47:10,359 - op_exists += 1 -> 6 2016-12-01 15:47:10,360 - op_get_file_status += 1 -> 20 2016-12-01 15:47:10,360 (S3AFileSystem.java:getFileStatus) - Getting path status for s3a://hwdev-stevel/numbers_rdd_tests 2016-12-01 15:47:10,360 - object_metadata_requests += 1 -> 39 2016-12-01 15:47:11,068 - object_metadata_requests += 1 -> 40 2016-12-01 15:47:11,241 - object_list_requests += 1 -> 21 2016-12-01 15:47:11,513 (S3AFileSystem.java:getFileStatus) - Found path as directory (with /)The way we check for a path p in Hadoop's S3 Client(s) is
HEAD p HEAD p/ LIST prefix=p, suffix=/, count=1A simple file: one HEAD. A directory marker, two, a path with no marker but 1+ child: three. In this run, it's an empty directory, so two of the probes are executed:
HEAD p => 708ms HEAD p/ => 445ms LIST prefix=p, suffix=/, count=1 => skippedThat's 1153ms from invocation of the exists() call to it returning true —long enough for you to see the log pause during the test run. Think about that: determining which operations to speed up not through some fancy profiler, but watching when the log stutters. That's how dramatic the long-haul cost of object store operations are. It's also why a core piece of the S3Guard work is to offload that metadata storage to DynamoDB. I'm not doing that code, but I am doing the committer to go with. To be ruthless, I'm not sure we can reliably do that O(1) rename, massively parallel rename being the only way to move blobs around, and the committer API as it stands precluding me from implementing a single-file-direct-commit committer. We can do the locking/leasing in dynamo though, along with the speedup.
What it should really highlight is that an assumption in a lot of code "getFileStatus() is too quick to measure" doesn't hold once you move into object stores, especially remote ones, and that any form of recursive treewalk is potentially pathologically bad.
Remember that that next time you edit your code.