Steve Loughran: hadoop

Showing posts with label hadoop. Show all posts

2022-08-02

Transitive Issues

i am not going to discuss anything sensitive which gets discussed in the hadoop security list, but i do want to post my reply to someone giving us a list of those artifacts with known CVEs, either directly or in their own shaded packaging of dependent libraries (jackson, mainly), then effectively demanding an immediate fix of them all

I my response is essentially "how do we do that without breaking everything downstream to the extent that nobody will upgrade?". Which is not me trying to dismiss their complaint, rather "if anyone has the answer to this problem I would really love to know".

I do regret that failure to get the OSGi support in to hadoop 0.1x; then we could have had that level of isolation. But OSGi does have its own issues, hence a lack of enthusiasm. But would it be worse than the state we have today?

The message. I am trying to do as much as I can via macos dictation, which invariably requires a review and fixup afterwards. if things are confused in places, it means I didn't review properly. As to who sent the email, that doesn't matter. It's good that transitive dependency issues are viewed a significant concern, bad that there's no obvious solution here apart from "I say we dust off and nuke the site from orbit"

(Photo: winter mountaineering in the Brecon Beacons, 1996 (?). Wondering how best to ski down the North Face of Pen y Fan)

Thank you for this list.

Except in the special case of "our own CVEs", all hadoop development is in public, including issue tracking, which can be searched under https://issues.apache.org/jira/

I recommend you search for upgrades of hadoop, hdfs, mapreduce and yarn, identify the dependencies you are worried about, follow the JIRA issues, and, ideally, help get them in by testing, if not actually contributing the code. If there are no relevant JIRAs, and please create them,

I have just made an release candidate for hadoop 3.3.4, for which, I have attached the announcement. please look at the changelog to see what has changed.

We are not upgrading everything in that list. There is a fundamental reason for this. Many of these upgrades are not compatible. While we can modify the hadoop code itself to support those changes, it means a release has become transitively incompatible. That is: even if we did everything we can to make sure our code does not break anything it is still going to break things because of those dependencies I. And as a result people aren't going to upgrade.

Take one example: HADOOP-13386 Upgrade Avro to 1.9.2.

This is marked as an incompatible update, "Java classes generated from previous versions of avro will need to be recompiled". If we ship that all your applications are going to break. As well everyone else's.

jersey updates are another source of constant pain, as the update to v2 breaks all v1 apps, and the two artifacts don't coexist. We had to fix that by using a custom release of jersey which doesn't use jackson.

HADOOP-15983 Use jersey-json that is built to use jackson2

So what do we do?

We upgrade everything and issue an incompatible release? Because if we do that we know that many applications will not upgrade and we will end up having to maintain the old version anyway. I'm 100% confident that this is true because we still have to do releases of Hadoop 2 with fixes for our own CVEs.

Or, do we try and safely upgrade everything we can and work with the downstream projects to help them upgrade their versions of Apache Hadoop so at least the attack surface is reduced?

This is not hypothetical. If I look at two pieces of work I have been involved in recently, or at least tracking.

PARQUET-2158. Upgrade Hadoop dependency to version 3.2.0.

That moves parquet's own dependency from hadoop 2.10 to 3.2.0, so it will actually compile and run against them. People will be able to run it against 3.3.4 too... but at least this way we have set the bare minimum to being a branch which has security fixes on.

HIVE-24484. Upgrade Hadoop to 3.3.1 And Tez to 0.10.2

This is an example of a team doing a major update; again it helps bring them more up-to-date with all their dependences as well as our own CVEs. From the github pull request you can see how things break, both from our own code (generally unintentionally) and from changes in those transitive dependencies. As a result of those breakages hive and tez have held back a long time.

One of the patches which is in 3.3.4 is intended to help that team

HADOOP-18332. Remove rs-api dependency by downgrading jackson to 2.12.7.

This is where we downgraded jackson from the 2.13.2.2 version of Hadoop 3.3.3 to version 2.12.7. This is still up to date with jackson CVEs, but by downgrading we can exclude its transitive dependency on the javax.ws.rs-api library, so Tez can upgrade, thus Hive. Once Hive works against Hadoop 3.3.x, we can get Apache Iceberg onto that version as well. But if the release was incompatible in ways that they considered a blocker, that wouldn't happen.

It really is a losing battle. Given your obvious concerns in this area I would love to have your suggestions as to how the entire Java software ecosystem –for that is what it is –can address the inherent conflict between the need to maintain the entire transitive set of dependencies for security reasons

A key challenge is the fact that often these update breaks things two away -a detail you often do not discover- until you ship. The only mitigation which has evolved is shading, having your own private copy of the binaries. Which as you note, makes it impossible for downstream projects to upgrade themselves.

What can you and your employers do to help?

All open source projects depend on the contributions of developers and users. Anything your company's engineering teams can do to help here will be very welcome. At the very least know that you have three days to qualify that 3.3.4 release to make sure that it does not break your deployed systems. If it does work, you should update all production system ASAP. If it turns out there is an incompatibility during this RC face we will hold the build and do our best to address. If you discover an problem after thursday, then it will not be addressed until the next release which you cannot expect to see until September, October or later. You can still help then by providing engineering resources to help validate that release. If you have any continuous integration tooling set up: check out and build the source tree and then try to compile and test your own products against the builds of hadoop and any other parts of the Apache Open Source and Big Data stack on which you depend.

To conclude then, I'd like to welcome you to participating in the eternal challenge of trying to keep those libraries up to date. Please join in. I would note that we are also looking for people with JavaScript skills as the yarn UI needs work and that is completely beyond my level of expertise.

If you aren't able to do this and yet you still require all dependencies to be up-to-date, I'm going to suggest you build and test your own software stack using Hadoop 3.4.0 as part of it. You would of course need to start with up-to-date versions of Jersey, Jackson, google guava, Amazon AWS and the like before you even get that far. However, the experience you get in trying to make this all work will again be highly beneficial to everyone.

Thanks,

Steve Loughran.

-----

[VOTE] Release Apache Hadoop 3.3.4

I have put together a release candidate (RC1) for Hadoop 3.3.4

The RC is available at:

https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.4-RC1/

The git tag is release-3.3.4-RC1, commit a585a73c3e0

The maven artifacts are staged at

https://repository.apache.org/content/repositories/orgapachehadoop-1358/

You can find my public key at:

https://dist.apache.org/repos/dist/release/hadoop/common/KEYS

Change log

https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.4-RC1/CHANGELOG.md

Release notes

https://dist.apache.org/repos/dist/dev/hadoop/hadoop-3.3.4-RC1/RELEASENOTES.md

There's a very small number of changes, primarily critical code/packaging

issues and security fixes.

See the release notes for details.

Please try the release and vote. The vote will run for 5 days.

2018-12-20

Isolation is not participation

First

I speak only for myself as an individual, not representing any current or previous employer, ASF ...etc.
I'm not going anywhere near business aspects; out of scope.
I am only looking at Apache Hadoop, which is of course the foundation of Amazon EMR. Which also means: if someone says "yes but projects X, Y & Z, ..." my response is "that's nice, but coming back to the project under discussion, ..."
lots of people I know & respect work for Amazon. I am very much looking at the actions of the company, not the individuals.
And I'm not making any suggestions about what people should do, only arguing that the current stance is harmful to everyone.
I saw last week that EMR now has a reimplementation of the S3A committers, without any credit whatsoever for something I consider profound. This means I'm probably a bit sensitive right now. I waited a couple of days before finishing this post,

With that out the way:-

As I type this a nearby terminal window runs MiniMR jobs against a csv.gz file listing AWS landsat photos, stored somewhere in a bucket.

The tests run on a macbook, a distant descendant of BSD linux, Mach Kernel and its incomplete open source sibling, Darwin. Much of the dev tooling I use is all open source, downloaded via homebrew. The network connection is via a router running DD-WRT.

That Landsat file, s3a://landsat-pds/scene_list.gz, is arguably the single main contribution from the AWS infra for that Hadoop testing.

It's a few hundred MB of free to use data, so I've used it for IO seek/read performance tests, spark dataframe queries, and now, SQL statements direct to the storage infrastructure. Those test are also where I get to explore the new features of the java language, LambdaTestUtils, which is my lifting of what I consider to be the best bits of scalatest. Now I'm adding async IO operations to the Hadoop FileSystem/FileContext classes, and in the tests I'm learning about the java 8+ completable future stuff, how to get them to run IOException-raising code (TL;DR: it hurts)

While I wait for my tests to finish, I see, there's a lot of online discussion about could providers and open source projects, especially post AWS re:Invent (re:Package?), so I'd thought I'd join in.

Of all the bits of recent writing on the topic, one I really like is Roman's, which focuses a lot on community over code.

That is a key thing: open source development is a community. And members of that community can participate by

writing code
reviewing code
writing docs
reviewing docs
testing releases, nightly builds
helping other people with their problems
helping other projects who are having problems with your project's code.
helping other projects take ideas from your code and make use of it
filing bug reports
reviewing, commenting, on, maybe even fixing bug reports.
turning up a conferences, talking about what you are doing, sharing
listening. Not just to people who pay you money, but people who want to use the stuff you've written.
helping build that community by encouraging the participation of others, nurturing their contributions along, trying to get them to embrace your code and testing philosophy, etc.

There are more, but those are some of the key ones.

A key, recurrent theme is that community, where you can contribute in many ways, but you do have to be proactive to build that community. And the best ASF projects are ones which have a broad set of contributors

Take for example, the grand Java 9, 10, 11 project: [HADOOP-11123, HADOOP-11423, HADOOP-15338 ]. That's an epic of suffering, primarily taken on by Akira Ajisaka, and Takanobu Asanuma at Yahoo! Japan, and a few other brave people. This isn't some grand "shout about this at keynotes" kind of feature, but its a critical contribution by people who rely on Hadoop and have a pressing need "Run on Java 9+", and are prepared to put in the effort. I watch their JIRAs with awe.

That's a community effort, driven by users with needs.

Another interesting bit of work: Multipart Upload from HDFS to other block stores. I've been the specification police there; my reaction to "adding fundamental APIs without strict specification and tests" was predictable, so I don't know why they tried to get it past me. Who did that work? Ewan Higgs at Western Digital did a lot -they can see real benefit for their enterprise object store. Virajith Jalaparti at Microsoft, People who want HDFS to be able to use their storage systems for the block store. And there's another side-effect: that mulitpart upload API essentially provides a standard API for multipart-upload based committers. For the S3A committers we added our own private API to S3A FS "WriteOperationHelper"; this will give it to every FS which supports it. And you can do a version of DistCP which writes blocks to the store in parallel from across the filesystem...a very high performance blockstore-to-block-or-object store copy mechanism.

This work is all being done by people who sell storage in one form or another, who see value in the feature, and are putting in the effort to develop it in the open, encourage participation from others, and deliver something independent of underlying storage

This bit of work highlights something else: that Hadoop FS API is broadly used way beyond HDFS, and we need to evolve it to deal with things HDFS offers but keeps hidden, but also for object stores, whose characteristics involve:

very expensive cost of seek(), especially given ORC and Parquet know their read plan way beyond the next read. Fix: HADOOP-11867: Vectorized Read Operations,
very expensive cost of mimicking hierarchical directory trees, treewalking is not ideal,
slow to open a file, as even the existence check can take hundreds of milliseconds.
Interesting new failure modes. Throttling/503 responses if you put too much load on a shard of the store, for example. Which can surface anywhere.
Variable rename performance O(data) for mimicked S3 rename, O(files) for GCS, O(1) with some occasional pauses on Azure.

There are big challenges here and it goes all the way through the system. There's no point adding a feature in a store if the APIs used to access it don't pick it up; there's no point changing an API if the applications people use don't adopt it.

Which is why input from the people who spent time building object stores and hooking their application is so important. That includes:

Code
Bug reports
Insight from their experiences
Information about the problems they've had
Problems their users are having

And that's also why the decision of of the EMR team to isolate themselves from the OSS development holds us back.

We end up duplicating effort, like S3Guard, which is the ASF equivalent of EMR consistent view. The fact that EMR shipped with a feature long before us, could be viewed as their benefit of having a proprietary S3 connector. But S3Guard, like the EMR team, models its work on Netflix S3mper. That's code which one of the EMR team's largest customers wrote, code which Netflix had to retrofit onto the EMR closed-source filesystem using AspectJ.

And that time-to-market edge? Well, is not so obvious any more

The EMR S3-optimised committer shipped in November 2018.
Its open source predecessor, the S3A committers, shipped in Hadoop 3.1, March 2018. That's over 7-8 months ahead of the EMR implementation.
And it shipped in HDP-3.0 in June 2018. That's 5 months ahead.

I'm really happy with that committer, first bit of CS-hard code I've done for a long time, got me into the depths of how committers really work, got an unpublished paper "A zero Rename Committer" from it. And, in writing the TLA+ spec of the object store I used on the way to convincing myself things worked, I was corrected by Lamport himself.

Much of that commit code was written by myself, but it depended utterly on some insights from Thomas Demoor of WDC, It also contains a large donation of code from Netflix, their S3 committer. They'd bypassed emrfs completely and were using the S3A client direct: we took this, hooked it up to what we'd already started to write, incorporated their mockito tests —and now their code is the specific committer I recommend. Because Netflix were using it and it worked.

A heavy user of AWS S3 wanting to fix their problem, sharing the code, having it pulled into the core code so that anyone using the OSS releases gets the benefit of their code *and their testing*.

We were able to pick that code up because Netflix had gone around emrfs and were writing things themselves. That is: they had to bypass the EMR team. And once they'd one that, we could take it, integrate it and ship it eight months before that EMR team did. With proofs of correctness(-ish).

Which is why I don't believe that isolationism is good for whoever opts out of the community. Nor, by the evidence, is is good for their customers.

I don't even think it helps the EMR team's colleagues with their own support calls. Because really, if you aren't active in the project, those colleagues end up depending on the goodwill of others.

(photo: building in central Havana. There'll be a starbucks there eventually)

2017-11-23

How to play with the new S3A committers

Following up from yesterday's post on the S3A committers, here's what you need for picking up the committers.

Apache Hadoop trunk; builds to 3.1.0-SNAPSHOT:
The documentation on use.
An AWS keypair, try not to commit them to git. Tip for the Uber team: git-secrets is something you can add as a checkin hook. Do as I do: keep them elsewhere.
If you want to use the magic committer; turn S3Guard on. Initially I'd use the staging committer, specificially the "directory" on.
switch s3a:// to use that committer: fs.s3a.committer.name = partitioned
Run your MR queries
look in _SUCCESS for committer info. 0-bytes long: classic FileOutputCommitter. Bit of JSON naming committer, files committed and some metrics (SuccessData) and you are using an S3 committer.

If you do that: I'd like to see the numbers comparing FileOutputCommitter (which must have S3Guard) and the new committers. For benchmark consistency, leave S3Guard on.

If you can't get things to work because the docs are wrong: file a JIRA with a patch. If the code is wrong: submit a patch with the fix & tests.

Spark?

Spark Master has a couple of patches to deal with integration issues (FNFE on magic output paths, Parquet being over-fussy about committers, I think the committer binding has enough workarounds for these to work with Spark 2.2 though.
Checkout my cloud-integration for Apache Spark repo, and its production-time redistributable, spark-cloud-integration.
Read its docs and use
If you want to use Parquet over other formats, use this committer.
Again,. check _SUCCESS to see what's going on.
There's a test module with various (scaleable) tests as well as a copy and paste of some of the Spark SQL test.
Spark can work with the Partitioned committer. This is a staging committer which only worries about file conflicts in the final partitions. This lets you do in-situ updates of existing datasets, adding new partitions or overwriting existing ones, while leaving the rest alone. Hence: no need to move the output of a job into the reference datasets.
Problems. File an issue. I've just seen Ewan has a couple of PRs I'd better look at, actually.

Committer-wise, that spark-cloud-integration module is ultimately transient. I think we can identify those remaining issues with committer setup in spark core, after which a hadoop 3.0+ specific module should be able to work out the box with the new committers.

There's still other things there, like

Cloud store optimised file input stream source.
ParallizedWithLocalityRDD: and RDD which lets you provide custom functions to declare locality on a row-by-row basis. Used in my demo of implementing DistCp in Spark. Every row is a filename, which gets pushed out to a worker close to the data, it does the upload. This is very much a subset of distCP, but it shows this: you can have with with RDDs and cloud storage.
+ all the tests

I think maybe Apache Bahir would be the ultimate home for this. For now, a bit too unstable.

(photo: spices on sale in a Mombasa market)

2017-11-22

subatomic

I've just committed HADOOP-13786 Add S3A committer for zero-rename commits to S3 endpoints.. Contributed by Steve Loughran and Ryan Blue.

This is a serious and complex piece of work; I need to thank:

Thomas Demoor and Ewan Higgs from WDC for their advice and testing. They understand the intricacies of the S3 protocol to the millimetre.
Ryan Blue for his Staging-based S3 committer. The core algorithms and code will be in hadoop-aws come Hadoop 3.1.
Colleagues for their support, including the illustrious Sanjay Radia, and Ram Venkatesh for letting me put so much time into this.
Reviewers, especially Ryan Blue, Ewan Higgs, Mingliang Liu and extra especially Aaron Fabbri @ cloudera. It's a big piece of code to learn. First time a patch of mine has ever crossed the 1MB source barrier

I now understand a lot about commit protocols in Hadoop and Spark, including the history of interesting failures encountered, events which are reflected in the change logs of the relevant classes. Things you never knew about the Hadoop MapReduce commit protocol

The two different algorithms, v1 and v2 have very different semantics about the atomicity of task and job commits, including when output becomes visible in the destination directory.
Neither algorithm is atomic in both task and job commit.
V1 is atomic in task commits, but O(files) in its non-atomic job commit. It can recover from any job failure without having rerun all succeeded tasks, but not from a failure in job commit.
V2's job commit is a repeatable atomic O(1) operation, because it is a no-op. Task commits do the move/merge, which are O(file), make the output immediately visible, and as a consequence, mean that failure of a job means the output directory is in an unknown state.
Both algorithms depend on the filesystem having consistent listings and Create/Update/Delete operations
The routine to merge the output of a task to the destination is a real-world example of a co-recursive algorithm. These are so rare most developers don't even know the term for them -or have forgotten it.
At-most-once execution is guaranteed by having the tasks and AM failing when they recognise that they are in trouble.
The App Master refuses to commit a job if it hasn't had a heartbeat with the YARN Resource Manager within a specific time period. This stops it committing work if the network is partitioned and the AM/RM protocol fails...YARN may have considered the job dead and restarted it.
tasks commit iff they get permission from the AM; thus they will not attempt to commit if the network partitions.
if a task given permission to commit does not report a successful commit to the AM; the V1 algorithm can rerun the task; v2 must conclude its in an unknown state and abort the job.
Spark can commit using the Hadoop FileOutputCommitter; its Parquet support has some "special" code which refuses to work if the committer is not a subclass of ParquetOutputCommitter
It adds the ability for tasks to provide extra data to its job driver for use in job commit; this allows committers to explicitly pass commit information directly to the driver, rather than indirectly via the (consistent) filesystem.
Everyone's code assumes that abort() completes in a bounded time, and does not ever throw that IOException its signature promises it can.
There's lots of cruft in the MRv2 codebase to keep the MRv1 code alive, which would be really good to delete

This means I get to argue the semantics of commit algorithms with people, as I know what the runtimes "really do", rather than believed by everyone who has neither implemented part of it or stepped throught the code in a debugger.

If we had some TLA+ specifications of filesystems and object stores, we could perhaps write the algorithms as PlusCal examples, but that needs someone with the skills and the time. I'd have to find the time to learn TLA+ properly as well as specify everything, so it won't be me.

Returning to the committers, what do they do which is so special?

They upload task output to the final destination paths in the tasks, but don't make the uploads visible until the job is committed.

No renames, no copies, no job-commit-time merges, and no data visible until job commit. Tasks which fail/fail to commit do not have any adverse side effects on the destination directories.

First, read S3A Committers: Architecture and Implementation.

Then, if that seems interesting look at the source.

A key feature is that we've snuck in to FileOutputFormat a mechanism to allow you to provide different committers for different filesystem schemas.

Normal file output formats (i.e. not Parquet) will automatically get the committer for the target filesystems, which, for S3A, can be changed from the default FileOutputCommitter to an S3A-specific one. And any other object store which also offers delayed materialization of uploaded data can implement their own and run it alongside the S3A ones, which will be something to keep the Azure, GCS and openstack teams busy, perhaps.

For now though: users of Hadoop can use Amazon S3 (or compatible services) as the direct destination of Hadoop and Spark workloads without any overheads of copying the data, and the ability to support failure recovery and speculative execution. I'm happy with that as a good first step.

(photo: street vendors at the Kenya/Tanzania Border)

2017-09-09

Stocator: A High Performance Object Store Connector for Spark

IBM have published a lovely paper on their Stocator 0-rename committer for Spark

Stocator is:

An extended Swift client
magic in their FS to redirect mkdir and file PUT/HEAD/GET calls under the normal MRv1 __temporary paths to new paths in the dest dir
generating dest/part-0000 filenames using the attempt & task attempt ID to guarantee uniqueness and to ease cleanup: restarted jobs can delete the old attempts
Commit performance comes from eliminating the COPY, which is O(data),
And from tuning back the number of HTTP requests (probes for directories, mkdir 0 byte entries, deleting them)
Failure recovery comes from explicit names of output files. (note, avoiding any saving of shuffle files, which this wouldn't work with...spark can do that in memory)
They add summary data in the _SUCCESS file to list the files written & so work out what happened (though they don't actually use this data, instead relying on their swift service offering list consistency). (I've been doing something similar, primarily for testing & collection of statistics).

Page 10 has their benchmarks, all; of which are against an IBM storage system, not real amazon S3 with its different latencies and performance.

Table 5: Average run time

	Read-Only 50GB	Read-Only 500GB	Teragen	Copy	Wordcount	Terasort	TPC-DS
Hadoop-Swift Base	37.80±0.48	393.10±0.92	624.60±4.00	622.10±13.52	244.10±17.72	681.90±6.10	101.50±1.50
S3a Base	33.30±0.42	254.80±4.00	699.50±8.40	705.10±8.50	193.50±1.80	746.00±7.20	104.50±2.20
Stocator	34.60±0.56	254.10±5.12	38.80±1.40	68.20±0.80	106.60±1.40	84.20±2.04	111.40±1.68
Hadoop-Swift Cv2	37.10±0.54	395.00±0.80	171.30±6.36	175.20±6.40	166.90±2.06	222.70±7.30	102.30±1.16
S3a Cv2	35.30±0.70	255.10±5.52	169.70±4.64	185.40±7.00	111.90±2.08	221.90±6.66	104.00±2.20
S3a Cv2 + FU	35.20±0.48	254.20±5.04	56.80±1.04	86.50±1.00	112.00±2.40	105.20±3.28	103.10±2.14

The S3a is the 2.7.x version, which has the stabilisation enough to be usable with Thomas Demoor's fast output stream (HADOOP-11183). That stream buffers in RAM & initiates the multipart upload once the block size threshold is reached. Provided you can upload data faster than you run out of RAM, it avoids the log waits at the end of close() calls, so has significant speedup. (The fast output stream has evolved into the S3ABlockOutput Stream (HADOOP-13560) which can buffer off heap and to HDD, and which will become the sole output stream once the great cruft cull of HADOOP-14738 goes in)

That means in the doc, "FU" == fast upload, == incremental upload & RAM storage. The default for S3A will become HDD storage, as unless you have a very fast pipe to a compatible S3 store, it's easy to overload the memory

Cv2 means MRv2 committer, the one which does single rename operation on task commit (here the COPY), rather than one in task commit to promote that attempt, and then another in job commit to finalise the entire job. So only: one copy of every byte PUT, rather than 2, and the COPY calls can run in parallel, often off the critical path

Table 6: Workload speedups when using Stocator

	Read-Only 50GB	Read-Only 500GB	Teragen	Copy	Wordcount	Terasort	TPC-DS
Hadoop-Swift Base	x1.09	x1.55	x16.09	x9.12	x2.29	x8.10	x0.91
S3a Base	x0.96	x1.00	x18.03	x10.33	x1.82	x8.86	x0.94
Stocator	x1	x1	x1	x1	x1	x1	x1
Hadoop-Swift Cv2	x1.07	x1.55	x4.41	x2.57	x1.57	x2.64	x0.92
S3a Cv2	x1.02	x1.00	x4.37	x2.72	x1.05	x2.64	x0.93
S3a Cv2 + FU	x1.02	x1.00	x1.46	x1.27	x1.05	x1.25	x0.93

Their TCP-DS benchmarks show that stocator & swift is slower than TCP-DS Hadoop 2.7 S3a + Fast upload & MRv2 commit. Which means that (a) the Hadoop swift connector is pretty underperforming and (b) with fadvise=random and columnar data (ORC, Parquet) that speedup alone will give better numbers than swift & stocator. (Also shows how much the TCP-DS Benchmarks are IO heavy rather than output heavy the way the tera-x benchmarks are).

As the co-author of that original swift connector then, what the IBM paper is saying is "our zero rename commit just about compensates for the functional but utterly underperformant code Steve wrote in 2013 and gives us equivalent numbers to 2016 FS connectors by Steve and others, before they started the serious work on S3A speedup". Oh, and we used some of Steve's code to test it, removing the ASF headers.

Note that as the IBM endpoint is neither the classic python Openstack swift or Amazon's real S3, it won't exhibit the real issues those two have. Swift has the worst update inconsistency I've ever seen (i.e repeatable whenever I overwrote a large file with a smaller one), and aggressive throttling even of the DELETE calls in test teardown. AWS S3 has its own issues, not just in list inconsistency, but serious latency of HEAD/GET requests, as they always go through the S3 load balancers. That is, I would hope that IBM's storage offers significantly better numbers than you get over long-haul S3 connections. Although it'd be hard (impossible) to do a consistent test there, I 'd fear in-EC2 performance numbers to be actually worse than that measures.

I might post something faulting the paper, but maybe I'll should to do a benchmark of my new committer first. For now though, my critique of both the swift:// and s3a:// clients is as follows

Unless the storage services guarantees consistency of listing along with other operations, you can't use any of the MR commit algorithms to reliably commit work. So performance is moot. Here IBM do have a consistent store, so you can start to look at performance rather than just functionality. And as they note, committers which work with object store semantics are the way to do this: for operations like this you need the atomic operations of the store, not mocked operations in the client.

People who complain about the performance of using swift or s3a as a destination are blisfully unaware of the key issue: the risk of data loss due inconsistencies. Stocator solves both issues at once.

Anyway, means we should be planning a paper or two on our work too, maybe even start by doing something about random IO and object storage, as in "what can you do for and in columnar storage formats to make them work better in a world where a seek()+ read is potentially a new HTTP request."

(picture: parakeet behind Picton Street)

2017-05-05

Is it time to fork Guava? Or rush towards Java 9?

Guava problems have surfaced again.

Hadoop 2.x has long-shipped Guava 14, though we have worked to ensure it runs against later versions, primarily by re-implementing our own classes of things pulled/moved across versions.

Hadoop trunk has moved up to Guava 21.0, HADOOP-10101.This has gone and overloaded the Preconditions.checkState() method, such that: if you compile against Guava 21, your code doesn't link against older versions of Guava. I am so happy about this I could drink some more coffee.

Classpaths are the gift that keeps on giving, and any bug report with the word "Guava" in it is inevitably going to be a mess. In contrast, Jackson is far more backwards compatible; the main problem there is getting every JAR in sync.

What to do?

Shade Guava Everywhere
This is going too be tricky to pull off. Andrew Wang has taken on this task. this is one of those low level engineering projects which doesn't have press-release benefits but which has the long-term potential to reduce pain. I'm glad someone else is doing it & will keep an eye on it.

Rush to use Java 9
I am so looking forward to this from an engineering perspective:

Pull Guava out
We could do our own Preconditions, our own VisibleForTesting attribute. More troublesome are the various cache classes, which do some nice things...hence they get used. That's a lot of engineering.

Fork Guava
We'd have to keep up to date with all new Guava features, while reinstating the bits they took away. The goal: stuff build with old Guava versions still works.

I'm starting to look at option four. Biggest issue: cost of maintenance.

There's also the fact that once we use our own naming "org.apache.hadoop/hadoop-guava-fork" then maven and ivy won't detect conflicting versions, and we end up with > 1 version of the guava JARs on the CP, and we've just introduced a new failure mode.

Java 9 is the one that has the best long-term potential, but at the same time, the time it's taken to move production clusters onto Java 8 makes it 18-24 months out at a minimum. Is that so bad though?

I actually created the "Move to Java 9": JIRA in 2014. It's been lurking there, Akira Ajisaka doing the equally unappreciated step-by-step movement towards it.

Maybe I should just focus some spare-review-time onto Java 9; see what's going on, review those patches and get them in. That would set things up for early adopters to move to Java 9, which, for in-cloud deployments, is something where people can be more agile and experimental.

(photo: someone painting down in Stokes Croft. Lost Crew tag)

2017-04-12

Mocking: an enemy of maintenance

I'm keeping myself busy right now with HADOOP-13786, an O(1) committer for job output into S3 buckets. The classic filesystem relies on rename() for that, but against S3 rename is a file-by-file copy whose time is O(data) and whose failure mode is "a mess", amplified by the fact that an inconsistent FS can create the illusion that destination data hasn't yet been deleted: false conflict.
. This creates failures like SPARK-18512., FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A, as well as long commit delays.

I started this work a while back, making changes into the S3A Filesystem to support it. I've stopped focusing on that committer, and instead pulled in the version which Netflix have been using, which has the advantages of a thought out failure policy, and production testing. I've been busy merging that with the rest of the S3A work, and am now at the stage where I'm switching it over to the operations I've written for the first attempt, the "magic committer". These are in S3A, where they integrate with S3Guard state updates, instrumentation and metrics, retry logic, etc etc. All good.

The actual code to do the switchover is straightforward. What is taking up all my time is fixing the mock tests. These are failing with false positives "I've broken the code", when really the cause is "these mock tests are too brittle". In particular, I've had to rework how the tracking of operations goes, as a Mock Amazon S3Ciient is no longer used by the committer, instead its associated with the FS instance, which then is shared by all operations in a single test method. And the use of S3AFS methods shows up where its failing due to the mock instance not initing properly. I ended up spending most of Tuesday simply implementing the abort() call, now I'm doing the same on commit(). The production code switches fine, it's just the mock stuff.

This has really put me off mocking. I have used it sporadically in the past, and I've occasionally had to work other people's. Mocking has some nice features

Can run in unit tests which don't need AWS credentials, so Yetus/Jenkins can run them on patches.
Can be used to simulate failures and validate outcomes.

But the disadvantage is I just think they are too high maintenance. One test I've already migrated to being an integration test against an object store; I retained the original mock one, but just deleted that yesterday as it was going to be too expensive to migrate, and, with
that IT test, obsolete.

The others, well: the changes for abort() should help, but every new S3A method that gets called triggers new problems which I need to address. This is, well, "frustrating".

It's really putting me off mocking. Ignoring the Jenkins aspect, the key benefit is structure fault injection. I believe I could implement that in the IT tests too, at least in those tests which run in the same JVM. If I wanted to, I could probably even do it in the forked VMs by f propagating details on the desired failures to the processes. Or, if I really wanted to be devious, by running an HTTP proxy in the test VM and simulating network failures for the AWS client code itself to hit. That wouldn't catch all real-world problems (DNS, routing), but I could raise authentication, transient HTTP failures, and of course, force in listing inconsistencies. This is tempting, because it will help me qualify the AWS SDK we depend on, and could be re-used for testing the Azure storage too. Yes, it would take effort —but given the cost of maintaining those Mock tests after some minor refactoring of the production code, it's starting to look appealing.

(photo: Garage door, Greenbank, Bristol)

2016-07-19

Gardening the commons

It's been a warm-but-damp summer in Bristol, and the vegetation in the local woods has been growing well. This means the bramble and hawthorn branches have been making their way out of the undergrowth and into the light —more specifically the light available in the mountain bike trails.

Being as both these plant's branches have spiky bits on them, the fact that they are growing onto the trails hurts, especially if you are trying to get round corners fast. And, if anyone is going round the trail without sunglasses they run a risk of getting hurt in/near the eye.

I do always wear sunglasses, but the limitations on taking the fast line though the trails hurts, and as there a lot of families out right now, I don't want the kids to get too scraped.

So on Saturday morning, much to the amazement of my wife, I picked up the gardening shears. Not to do anything in our garden though —to take to the woods with me.

Findings

Those Kevlar backed gloves that the Clevedon police like are OK on the outside for working with spiky vegetation, but the fingertips are vulnerable.
Gardening gets boring fast.
When gardening an MTB trail, look towards the direction oncoming riders will take.
A lot of people walking dogs get lost and ask for directions back to the car park.
Someone had already gardened about 1/3 of the Yertiz trail. (pronounciation based on: "yeah-it-is"
Nobody appreciates your work.

I appreciate the outcome my own work, I can now go round at speed, only picking up scrapes on the forearms on the third of the trail that nobody has trimmed yet. I actually cut back on the inside of the corners there for less damage on the racing line, while cutting the outside and face height bits for the families. Now they can have more fun on the weekends, I can do my fast work weekday lunchtimes.

They can live their lives with fewer wailing children, and I've partially achieved my goal of less blood emitted per iteration. I'll probably finish it off with the final 1/3 at some point, maybe mid-august, as I can't rely on anyone else.

There are no trail pixies but but what we make

Which brings me to OSS projects, especially patches and bug reports in Hadoop.

I really hate it when my patches get completely ignored, big or small.

Take YARN-679 for example. Wrap up piece of YARN-117, code has celebrated its third birthday. Number of reviewers. Zero. I know its a big patch, but it's designed to produce a generic entry point for YARN services with reflection based loading of config files (you can ask for HiveConfig and HBaseConfig too, see), interrupt handling which even remembers if its been called once, so on the second control-C bypasses the shutdown hooks (assumption: they've blocked on some IPC-retry chatter with a now-absent NN), and bail out fast. Everything designed to support exit codes, subclass points for testability. This should be the entry point for every single YARN service, and it hasn't had a single comment by anyone. How does that make me feel. Utterly Ignored —even by colleagues. I do, at least, understand what else they are working on...it's not like there is a room full of patch reviewers saying "there are no patches to review —let's just go home early". All the people with the ability to review the patches have so many commitments of their own, that the time to review they can allocate is called "weekends".

And as a result, I have a list of patches awaiting review and commit, patches which are not only big diffs, they are trivial ones which fix things like NPEs in reporting errors returned over RPC. That's a 3KB patch, reaching the age where, at least with my own child, we were looking at nursery schools. Nothing strategic, something useful when things go wrong. Ignored.

That's what really frustrates me: small problems, small patches, nobody looks at it.

And I'm as guilty as the rest. We provide feedback on some patch-in-progress, then get distracted and never do the final housekeeping. I feel bad here, precisely because I understand the frustration.

Alongside "old, neglected patches", there are "old, neglected bugs". Take, for an example HADOOP-3733 "s3:" URLs break when Secret Key contains a slash, even if encoded. Stuart Sierra gave a good view of the history from his perspective.

The bug was filed in 2008

it was utterly ignored
Lots of people said they had the same problem
Various hadoop developers said "Cannot reproduce"
It was eventually fixed on 2016-06-16 with a patch by one stevel@apache.
On 2016-06-16 cnauroth@apache filed HADOOP-13287 saying "TestS3ACredentials#testInstantiateFromURL fails if AWS secret key contains '+'.".

Conclusions

The Hadoop developers neglect things
if we'd fixed things earlier, similar problems won't arise.

I mostly concur. Especially in the S3 support, where historically the only FTEs working on it were Amazon, and they have their own codebase. In ASF Hadoop, Tom White started the code, and it's slowly evolved, but it's generally been left to various end users to work on.

Patch submission is also complicated by the fact that for security reasons, Jenkins doesn't test the stuff. We've had enough problems of people under-testing their patches here that there is a strictly enforced policy of "tell us which infrastructure you tested against". The calling out of "name the endpoint" turns out to be far better at triggering honest responses than "say that you tested this". And yes, we are just as strict with our colleagues. A full test run of the hadoop-aws module takes 10-15 minutes, much better than the 30 minutes it used to take, but still means that any review of a patch is time consuming.

I would normally estimate the time to review an S3 patch to take 1-2 hours. And, until a few of us sat down to work on S3A functionality and performance against Hive and Spark, those 1-2 hours were going weekend time only. Which is why I didn't do enough reviewing.

Returning to the S3 "/", problem

This whole thing was related to AWS-generated secrets. Those of us whose AWS secrets didn't have a "/" in this couldn't replicate the problem. Thus it was a configuration-space issue rather than something visible to all.
There was a straightforward workaround, "generate new credentials", so it wasn't a blocker.
That related issue, HADOOP-13287, is actually highlighting a regression caused by the fix for HADOOP-3733. In the process for allowing URLs to contain "/" symbols, we managed to break the ability to use "+" in them.
The regression was caught because the HADOOP-3733 patch included some tests which played with the tester's real credentials. Fun problem: writing tests to do insecure things which don't leak secrets in logs and assert statements.
HADOOP-13287 is not an example of "there are nearby problems" so much as "every bug fix moves the bug", something noted in Brook's "the mythical man month" in his coverage of IBM OS patches.
And again, this is a c-space problem, it was caught because Chris had + in his secret.

Finally, and this is the reason why it didn't surface with many of us, even though we had "/" in the secret is because the problem only arises if you put your AWS secrets in the URL itself, as s3a://AWSID:secret-aws-key@bucket

That is: if your filesystem URI contains secrets, which, if leaked —threaten the privacy and integrity of your data and is at risk of running up large bills, then, if the secret has a "/", the URL doesn't authenticate.

This is not actually an action I would recommend. Why? Because throughout the Hadoop codebase we assume that filesystem URIs do not contain secrets. They get logged, they get included in error messages, they make their way into stack traces that can go into bug reports. AWS credentials are too important to be sticking in URLs.

Once I realised people were doing this, I did put aside a morning to fix things. Not so much fixing the encoding of "/" in the secrets (and accidentally breaking the encoding of "+" in the process), but:

Pulling out the auth logic for s3, s3n and s3a into a new class, S3xLoginHelper.
Having that code strip out user:pass from the FS URL before the S3 filesystems pass it up to their superclass.
Doing test runs and seeing if that is sufficient to keep those secret out the logs (it isn't).
Having S3xLoginHelper print a warning whenever somebody tries to use secrets in URLs.
Edit the S3 documentation file to tell people not to do this —and warning the feature may be removed.
Edit the Hadoop S3 wiki page telling people not to do this.
Finally: fix the encoding for /, adding tests
Later, fix the test for +

That's not just an idle "may be removed" threat. In HADOOP-13252, you can declare which AWS credential providers to support in S3A, be it your own, conf-file, env var, IAM, and others. If you start doing this, your ability to embed secrets in s3a URLs goes away. Assumption: if people know what they are doing, they shouldn't be doing things so insecure.

Anyway, I'm glad my effort fixing the issue is appreciated. I also share everyone's frustration with neglected patches, as it wastes my effort and leaves the bugs unfixed, features ignored.

We need another bug bash. And I need to give that final third of the Yertiz trail a trim.

2016-05-28

Fear of Dependencies

There are some things to be scared of; some things to view as a challenge and embrace anyway.

Here, Hardknott Pass falls into the challenge category —at least in summertime. You know you'll get up, the only question is "cycling" or "walking".

Hardknott in Winter is a different game, its a "should I be trying to get up here at all" kind of issue. Where, for reference, the answer is usually: no. Find another way around.

Upgrading dependencies to Hadoop jitters between the two, depending on what upgrade is being proposed.

And, as the nominal assignee of HADOOP-9991, "upgrade dependencies", I get to see this.

We regularly get people submitting one line patches "upgrade your dependency so you can work with my project' —and they are such tiny diffs people think "what a simple patch, it's easy to apply"

The problem is they are one line patches that can lead to the HBase, Hive or Spark people cornering you and saying things like "why do you make my life so hard?"

Before making a leap to Java 9, we're trapped whatever we do. Upgrade: things downstream break. Don' t upgrade, things downstream break when they update something else, or pull in a dependency which has itself updated.

While Hadoop has been fairly good at keeping its own services stable, where it causes problems is in applications that pull in the Hadoop classpath for their own purposes: HBase, Hive, Accumulo, Spark, Flink, ...

Here's my personal view on the risk factor of various updates.

Critical :

We know things will be trouble —and upgrades are full cross-project epics

protobuf., This will probably never be updated during the lifespan of Hadoop 2, given how google broke its ability to link to previously generated code.
Guava. Google cut things. Hadoop ships with Guava 11 but has moved off all deleted classes so runs happily against Guava 16+. I think it should be time just to move up, on the basis of Java 8 compatibility alone.
Jackson. The last time we updated, everything worked in Hadoop, but broke HBase. This makes everyone very said
In Hive and Spark: Kryo. Hadoop core avoids that problem; I did suggest adding it purely for the pain it would cause the Hive team (HADOOP-12281) —they knew it wasn't serious but as you can see, others got a bit worried. I suspect it was experience with my other POM patches that made them worry.

I think a Jackson update is probably due, but will need conversations with the core downstream projects. And perhaps bump up Guava, given how old it is.

High Risk

Failures are traumatic enough we're just scared of upgrading unless there's a good reason.

jetty/servlets. Jetty has been painful (threads in the Datanodes to peform liveness monitoring of Jetty is an example of workarounds), but it was a known and managed problem). Plan is to move off jetty entirely and -> jersey + grizzly.
Servlet API.
jersey. HADOOP-9613 shows how hard that's been
Tomcat. Part of the big webapp set
Netty —again, a long standing sore point (HADOOP-12928, HADOOP-12927)

httpclient. There's a plan to move off Httpclient completely, stalled on hadoop-openstack. I'd estimate 2-3 days there, more testing than anything else. Removing a dependency entirely frees downstream projects from having to worry about the version Hadoop comes with.
Anything which has JNI bindings. Examples: leveldb, the codecs
Java. Areas of trauma: Kerberos, java.net, SASL,

With the move of trunk to Java 8, those servlet/webapp versions all need to be rolled.

Medium Risk

These are things where we have to be very cautious about upgrading, either because of a history of brittleness, or because failures would be traumatic

Jets3t. Every upgrade of jets3t moved the bugs. It's effectively frozen as "trouble, but a stable trouble", with S3a being the future
Curator 2.x ( see HADOOP-11612 ; HADOOP-11102) I had to do a test rebuild of curator 2.7 with guava downgraded to Hadoop's version to be confident that there were no codepaths that would fail. That doesn't mean I'm excited by Curator 3, as it's an unknown.
Maven itself
~~Zookeeper -for its use of guava.~~

Here I'm for leaving Jets3t alone; and, once that Guava is updated, curator and ZK should be aligned.

Low risk:

Generally happy to upgrade these as later versions come out.

SLF4J yes, repeatedly
log4j 1.x (2.x is out as it doesn't handle log4j.properties files)
~~avro as long as you don't propose picking up a pre-release.~~
(No: Avro 1.7 to 1.8 update is incompatible with generated compiled classes, same as protobuf.)
apache commons-lang,(minor -yes, major -no)
Junit

I don't know which category the AWS SDK and azure SDKs fall into. Their jackson SDK dependency flags them as a transitive troublespot.

Life would be much easier if (a) the guava team stopped taking things away and (b) either jackson stopped breaking things or someone else produced a good JSON library. I don't know of any -I have encountered worse.

2016-05-31 Update: ZK doesn't use Guava. That's curator I'm thinking of. Correction by Chris Naroth.

2016-04-26

Distributed Testing: making use of the metrics

Summary

In this article I introduce the concept of Metrics-first Testing, and show how instrumenting the internals of classes, enabling them to be published as metrics, enables better testing of distributed systems, while also offering potential to provide more information in production.

Exporting instrumented classes in the form of remotely accessible metrics permits test runners to query the state of the System Under Test, both to make assertions about its state, and to collect histories and snapshots of its state for post-run diagnostics.

This same observable state may be useful in production —though there is currently no evidence to support this hypothesis.

There are a number of issues with the concept. A key one is if these metrics do provde useful in production, then they become part of the public API of the system, and must be supported across future versions.

Introduction: Metrics-first Testing

I've been doing more scalatest work, as part of SPARK-7889, SPARK-1537, SPARK-7481. Alongside that, in SLIDER-82, anti-affine work placement across a YARN cluster, And, most recently, wrapping up S3a performance and robustness for Hadoop 2.8, HADOOP-11694, where the cost of an HTTP reconnect appears on a par with reading 800KB of data, meaning: you are better off reading ahead than breaking a connection on any forward seek under ~900KB. (that's transatlantic to an 80MB FTTC connection; setup time is fixed, TCP slow start also means that the longer the connection is held, the better the bandwidth gets)

On these projects, I've been exploring the notion of metrics-first testing. That is: your code uses metric counters as a way of exposing the observable state of the core classes, and then tests can query those metrics, either at the API level or via web views.

Here's a test for HADOOP-13047,: S3a Forward seek in stream length to be configurable

@Test
  public void testReadAheadDefault() throws Throwable {
    describe("Verify that a series of forward skips within the readahead" +
        " range do not close and reopen the stream");
    executeSeekReadSequence(32768, 65536);
    assertEquals("open operations in " + streamStatistics,
        1, streamStatistics.openOperations);
  }

Here's the output

testReadAheadDefault: Verify that a series of forward skips within the readahead
  range do not close and reopen the stream

2016-04-26 11:54:25,549 INFO  Reading 623 blocks, readahead = 65536
2016-04-26 11:54:29,968 INFO  Duration of Time to execute 623 seeks of distance 32768
 with readahead = 65536: 4,418,524,000 nS
2016-04-26 11:54:29,968 INFO  Time per IOP: 7,092,333 nS
2016-04-26 11:54:29,969 INFO  Effective bandwidth 0.000141 MB/S
2016-04-26 11:54:29,970 INFO  StreamStatistics{OpenOperations=1, CloseOperations=0,
  Closed=0, Aborted=0, SeekOperations=622, ReadExceptions=0, ForwardSeekOperations=622,
  BackwardSeekOperations=0, BytesSkippedOnSeek=20381074, BytesRead=20381697,
  BytesRead excluding skipped=623, ReadOperations=0, ReadsIncomplete=0}

I'm collecting internal metrics of a stream, and using that to make assertions about the correctness of the code. Here, that if I set the readahead range to 64K, then a series of seek and read operations stream through the file, rather than break and reconnect the HTTPS link.

This matters a lot, as shown by one of the other tests, which times an open() call as well as that to actually read the data

testTimeToOpenAndReadWholeFileByByte: Open the test file
  s3a://landsat-pds/scene_list.gz and read it byte by byte

2016-04-26 11:54:47,518 Duration of Open stream: 181,732,000 nS
2016-04-26 11:54:51,688 Duration of Time to read 20430493 bytes: 4,169,079,000 nS
2016-04-26 11:54:51,688 Bandwidth = 4.900481  MB/S
2016-04-26 11:54:51,688 An open() call has the equivalent duration of
  reading 890,843 bytes

Now here's a Spark test using the same source file and s3a connector

ctest("CSVgz", "Read compressed CSV", "") {
    val source = sceneList
    sc = new SparkContext("local", "test", newSparkConf(source))
    val sceneInfo = getFS(source).getFileStatus(source)
    logInfo(s"Compressed size = ${sceneInfo.getLen}")
    val input = sc.textFile(source.toString)
    val (count, started, time) = duration2 {
      input.count()
    }
    logInfo(s" size of $source = $count rows read in $time nS")
    assert(ExpectedSceneListLines <= count)
    logInfo(s"Filesystem statistics ${getFS(source)}")
  }

Which produces, along with the noise of a local spark run, some details on what the FS got up to

2016-04-26 12:08:25,901  executor.Executor Running task 0.0 in stage 0.0 (TID 0)
2016-04-26 12:08:25,924  rdd.HadoopRDD Input split: s3a://landsat-pds/scene_list.gz:0+20430493
2016-04-26 12:08:26,107  compress.CodecPool - Got brand-new decompressor [.gz]
2016-04-26 12:08:32,304  executor.Executor Finished task 0.0 in stage 0.0 (TID 0). 
  2643 bytes result sent to driver
2016-04-26 12:08:32,311  scheduler.TaskSetManager Finished task 0.0 in stage 0.0 (TID 0)
  in 6434 ms on localhost (1/1)
2016-04-26 12:08:32,312  scheduler.TaskSchedulerImpl Removed TaskSet 0.0, whose tasks
  have all completed, from pool 
2016-04-26 12:08:32,315  scheduler.DAGScheduler ResultStage 0 finished in 6.447 s
2016-04-26 12:08:32,319  scheduler.DAGScheduler Job 0 finished took 6.560166 s
2016-04-26 12:08:32,320  s3.S3aIOSuite  size of s3a://landsat-pds/scene_list.gz = 464105
  rows read in 6779125000 nS

2016-04-26 12:08:32,324 s3.S3aIOSuite Filesystem statistics
  S3AFileSystem{uri=s3a://landsat-pds,
  workingDir=s3a://landsat-pds/user/stevel,
  partSize=104857600, enableMultiObjectsDelete=true,
  multiPartThreshold=2147483647,
  statistics {
    20430493 bytes read,
     0 bytes written,
     3 read ops,
     0 large read ops,
     0 write ops},
     metrics {{Context=S3AFileSystem}
      {FileSystemId=29890500-aed6-4eb8-bb47-0c896a66aac2-landsat-pds}
      {fsURI=s3a://landsat-pds/scene_list.gz}
      {streamOpened=1}
      {streamCloseOperations=1}
      {streamClosed=1}
      {streamAborted=0}
      {streamSeekOperations=0}
      {streamReadExceptions=0}
      {streamForwardSeekOperations=0}
      {streamBackwardSeekOperations=0}
      {streamBytesSkippedOnSeek=0}
      {streamBytesRead=20430493}
      {streamReadOperations=1488}
      {streamReadFullyOperations=0}
      {streamReadOperationsIncomplete=1488}
      {files_created=0}
      {files_copied=0}
      {files_copied_bytes=0}
      {files_deleted=0}
      {directories_created=0}
      {directories_deleted=0}
      {ignored_errors=0} 
      }}

What's going on here?

I've instrumented S3AInputStream, instrumentation which is then returned to its S3AFileSystem instance.
This instrumentation can not only be logged, it can be used in assertions.

And, as the FS statistics are actually Metrics2 data, they can be collected from running applications.

By making the observable state of object instances real metric values, I can extend their observability from unit tests to system tests —all the way to live clusters.

This makes assertions on the state of remote services a simple matter of GET /service/metrics/$metric + parsing.
It ensures that the internal state of the system is visible for diagnostics of both test failures and production system problems. Here: how is the file being accessed? Is the spark code seeking too much —especially backwards? Were there any transient IO problems which were recovered from?
These are things which the ops team may be grateful for in the future, as now there's more information about what is going on.
It encourages developers such as myself to write those metrics early, at the unit test time, because we can get immediate tangible benefit from their presence. We don't need to wait until there's some production-side crisis and then rush to hack in some more logging. Classes are instrumented from the outset. Indeed, in SPARK-11373 I'm actually implementing the metrics publishing in the Spark History server —something the SPARK-7889 code is ready for.

Metrics-first testing, then, is instrumenting the code and publishing it for assertions in unit tests, and for downstream test suites.

I'm just starting to experiment with this metrics-first testing.

I have ambitions to make metric capture and monitoring a more integral part of test runs. In particular, I want test runners to capture those metrics. That's either by setting up the services to feed the metrics to the test runner itself, capturing the metrics directly by polling servlet interfaces, or capturing them indirectly via the cluster management tools.

Initially that'll just be a series of snapshots over time, but really, we could go beyond that and include in test reports the actual view of the metrics: what happened to various values over time? when when Yarn timeline server says its average CPU was at 85%, what was the spark history server saying its cache eviction rate was?

Similarly, those s3a counters are just initially for microbenchmarks under hadoop-tools/hadoop-aws, but they could be extended up the stack, through Hive and spark queries, to full applications. It'll be noisy, but hey, we've got tooling to deal with lots of machine parseable noise, as I call it: Apache Zeppelin.

What are the flaws in this idea?

Relevance of metrics beyond tests.

There's the usual issue: the metrics we developers put in aren't what the operations team need. That's inevitable, but at least we are adding lots of metrics into the internal state of the system, and once you start instrumenting your code, you are more motivated to continue to add the others.

Representing Boolean values

I want to publish a boolean metric: has the slider App Master had a node map update event from the YARN RM? That's a bool, not the usual long value metrics tools like. The fix there is obvious for anyone who has programmed in C:

public class BoolMetric extends AtomicBoolean implements Metric, Gauge<integer> {

  @Override
  public Integer getValue() {
    return get() ? 1 : 0;
  }

It's not much use as a metric, except in that case that you are trying to look at system state and see what's wrong. It actually turns out that you don't get an initial map —something which GETs off the Coda Hale JSON metric servlet did pick up in a minicluster test. It's already paid for itself. I'm happy. It's just it shows the mismatch between what is needed to monitor a running app, things you can have triggers and graphs of, and simple bool state view.

Representing Time

I want to track when an update happened, especially relative to other events across the system. I don't see (in the Coda Hale metrics) any explicit notion of time other than histograms of performance. I want to publish a wall time, somehow. Which leaves me with two options. (a) A counter listing the time in milliseconds *when* something happened. (b) A counter listing the time in milliseconds *since* something happened. From a monitoring perspective, (b) is better: you could set an alarm if the counter value went over an hour.

From a developer perspective, absolute values are easier to test with. They also support the value "never" better, with something "-1" being a good one here. I don't know what value of "never" would be good in a time-since-event value which couldn't be misinterpreted by monitoring tools. A value of -1 could be construed as good, though if it had been in that state for long enough, it becomes bad. Similarly, starting off with LONG_MAX as the value would set alarms off immediately. Oh, and either way, the time isn't displayed as a human readable string. In this case I'm using absolute times.

I'm thinking of writing a timestamp class that publishes an absolute time on one path, and a relative time on an adjacent path. Something for everyone

The performance of AtomicLongs

Java volatile variables are slightly more expensive than C++ ones, as they act as barrier operations rather than just telling the compiler never to cache them. But they are still simple types.

In contrast, Atomic* are big bits of Java code, with lots of contention if many threads try to update some metric. This is why Coda Hale use a an AtomicAccumulator class, one that eventually surfaces in Java 8..

But while having reduced contention, that's still a piece of java code trying to acquire and release locks.

It would only take a small change in the JRE for volatile, or perhaps some variant, atomic to implement atomic ++ and += calls at the machine code level, so the cost of incrementing a volatile would be almost the same as setting it.

We have to assume that Sun didn't do that in 1995-6 as they were targeting 8 bit machines, where even incrementing a 16 bit short value was not something all CPUs could guarantee to do atomically.

Nowadays, even watches come with 32 bit CPUs; phones are 64 bit. It's time for Oracle to look ahead and conclude that it's time for even 64 bit volatile addition to made atomic.

For now, I'm making some of the counters which I know are only being updated within thread-safe code (or code that says "should only be used in one thread") volatile; querying them won't hold up the system.

Metrics are part of your public API

This is the troublesome one: If you start exporting information which your ops team depends on, then you can't remove it. (Wittenauer, a reviewer of a draft of this article, made that point quite clearly). And of course, you can't really tell which metrics end up being popular. Not unless you add metrics for that, and, well, you are down a slippery slope of meta-metrics at that point.

The real issue here becomes not exposing more information about the System Under Test, but exposing internal state which may change radically across versions.

What I'm initially thinking of doing here is having a switch to enable/disable registration of some of the more deeply internal state variables. The internal state of the components are not automatically visible in production, but can be turned on with a switch. That should at least make clear that some state is private.

However, it may turn out that the metrics end up being invaluable during troubleshooting; something you may not discover until you take them away.

Keeping an eye on troubleshooting runbooks and being involved in support calls will keep you honest there.

Pressure to align your counters into a bigger piece of work

For the S3a code, this surfaces in HDFS-10175; a proposal to make more of those FS level stats visible, so that at the end of an SQL query run, you can get aggregate stats on what all filesystems have been up to. I do think this is admirable, and with the costs of an S3 HTTP reconnect being 0.1s, it's good to know how many there are.

At the same time, these overreaching goals shouldn't be an excuse to hold up the low level counters and optimisations which can be done at a micro level —what they do say is "don't make this per-class stuff public" until we can do it consistently. The challenge then becomes technical: how to collect metrics which would segue into the bigger piece of work, are useful on their own, and which don't create a long term commitment of API maintenance.

Over-instrumentation

As described by Jakob Homan: " Large distributed systems can overwhelm metrics aggregators. For instance, Samza jobs generated so many metrics LI's internal system blanched and we had to add a feature to blacklist whole metric classes "

These low-level metrics may be utterly irrelevant to most processes, yet, if published and recorded, will add extra load to the monitoring infrastructure.

Again, this argues for making the low-level metrics off by default, unless explicitly enabled by a debugging switch.

In fact, it almost argues for having some metric enabling profile similar to log4J settings, where you could turn on, say, the S3a metrics at DEBUG level for a run, leaving it off elsewhere. That could be something to investigate further.

Perhaps I could start by actually using the log level of the classes as the cue to determine which metrics to register:

if (LOG.isDebugEnabled) {
    registerInternalMetrics();
}

Related work

I've been trying to find out who else has done this, and what worked/didn't work, but there doesn't seem too much in published work. There's a lot of coverage of performance testing —but this isn't that. This about a philosophy of instrumenting code for unit and system tests, using metrics as that instrumentation —and in doing so not only enabling better assertions to be made about the state of the System Under Test, but hopefully providing more information for production monitoring and diagnostics.

Conclusions

In Distributed Testing, knowing more about state of the System Under Test aids both assertions and diagnostics. By instrumenting the code better, or simply making the existing state accessible as metrics, it becomes possible to query that state during test runs. This same instrumentation may then be useful in the System In Production —though that it is something which I currently lack data about.

Acknowledgements

It's not often (ever?) that I get people to review blog posts before I publish them: this one I did as it's introducing concepts in system testing which impacts everything from code to production system monitoring. Thank you to the reviewers: Jakob Homan, Chris Douglas, Jay Kreps, Allen Wittenauer, Martin Kleppman.

I don't think I've addressed all their feedback, especially Chris's (security, scope,+ others), and Jay went into detail on how structured logging would be superior —something I'll let him expound on in the Confluent blog.

Interestingly, I am exposing the s3a metrics as log data, —it lets me keep those metrics internal, and lets me see their values in Spark tests without changing that code.

AW pointed out that I was clearly pretty naive in terms of what modern monitoring tools could do, and should do more research there: On first blush, this really feels naïve as to the state of the art of monitoring tools, especially in the commercial space where a lot of machine learning is starting to take shape (e.g., Netuitive, Circonus, probably Rocana, etc, etc). Clearly I have to do this...

(Artwork: 3Dom in St Werburgh's)

2015-07-16

Book Review, Hadoop Security, and distributed security in general

I've been reading the new ORA book, Hadoop Security, by Ben Spivey and Joey Echeverria. There's not many reviews up there, so I'll put mine up

Summary

reasonable intro to kerberos hadoop clusters
covers the identity -> cluster user mapping problem
ACLs in HDFS, YARN &c covered nicely —explanation and configuration
Shows you pretty much how to configure every Hadoop service for authenticated and authorized access, audit loggings and data & transport encryption.
has Sentry coverage, if that matters to you
Has some good "putting it all together" articles
Index seems OK.
Avoids delving into the depths of implementation (strength and weakness)

Overall: good from an ops perspective, for anyone coding in/against Hadoop, background material you should understand —worth buying.

I'd bought a copy of the ebook while it was still a work in progress, so I got to see the original Chapter 2, "securing distributed systems: chapter to come". I actually think they should have left that page as it is on the basis that Distributed System Security is a Work in Progress. And while it's easy for all of us to say "defence in depth", none of us really practice that properly even at home. Where is the two-tier network with the fundamentally untrustable IoT layer: TVs, light bulbs, telephones, bittorrent servers, on a separate subnet from the critical household infrastructure from the desktops, laptops and home servers. How many of us keep our ASF, SSH and github credentials on an encrypted USB stick which must be unlocked for use? None of us. Bear that in mind whenever someone talks about security infrastructure: ask them how they lock down their house. (*)

Kerberos is the bit I worry about day to day, so how does it stack up?

I do think it covers the core concepts-as-a-user, and has a workflow diagram which presents time quite nicely. It avoids going in to those details of the protocol, which, as anyone who has ever read Colouris & Dolimore will note, is mindnumbingly complex and does hit the mathematics layer pretty hard. A good project for learning TLA+ would probably be "specify Kerberos"

ACLs are covered nicely too, while encryption covers HDFS, Linux FS and wire encryption, including the shuffle.

There's coverage of lots of the Hadoop stack, core Hadoop, HBase, Accumulo, Zookeeper, Oozie & more. There's some specifics on Cloudera bits: Impala, Sentry, but not exclusively and all the example configs are text files, not management tool centric: they'll work everywhere.

Overall then: a pretty thorough book on Hadoop security, for a general overview of security, Kerberos, ACLs and configuring Hadoop it brings together everything in to one place.

If you are trying to secure a Hadoop cluster, invest in a copy

Limitations

Now, where is it limited?

1. A lot of the book is configuration examples for N+ services & audit logs. it's a bit repetitive, and I don't think anybody would sit down and type those things in. However, there are so many config files in the Hadoop space, and at least how to configure all the many services is covered. It just hampers the readability of the book.

2. I'd have liked to have seen the HDFS encryption mechanism illustrated, especially KMS integration. It's not something I've sat down to understand, and the same UML sequence diagram style used for Kerberos would have gone down.

3. It glosses over precisely how hard it is to get Kerberos working, how your life will be frittered away staring at error messages which make no sense whatsoever, only for you to discover later they mean "java was auto updated and the new version can't do long-key crypto any more". There's nothing serious in this book about debugging a Hadoop/Kerberos integration which isn't working.

4. Its bit on coding against Kerberos is limited to a couple of code snippets around UGI login and doAs. Given how much pain it it takes to get Kerberos to work client side, including ticket renewal, delegation token creation, delegation token renewal, debugging, etc, one and a half pages isn't even a start.

Someone needs to document Hadoop & Kerberos for developers —this book isn't it.

I assume that's a conscious decision by the authors, for a number of valid reasons

It would significantly complicate the book.
It's a niche product, being for developers within the Hadoop codebase.
It'd make maintenance of the book significantly harder.
To write it, you need to have experienced the pain of adding a new Hadop IPC, writing client tests against in-VM zookeeper clusters locked down with MiniKDC instances, or tried to debug why Jersey+SPNEGO was failing after 18 hours on test runs.

The good news is that I have experience the suffering of getting code to work on a secure Hadoop cluster, and want to spread that suffering more broadly.

For that reason, I would like to announce the work in progress, gitbook-toolchained ebook:

Kerberos and Hadoop: The Madness beyond the Gate

This is an attempt to write down things I've learned, using a Lovecraftian context to make clear this is forbidden knowledge that will drive the reader insane**. Which is true. Unfortunately, if you are trying to write code to work in a Hadoop cluster —especially YARN applications or anything acting as a service for callers, be they REST or IPC, you need to know this stuff.

It's less relevant for anyone else, though the Error Messages to Fear section is one of the things I felt the Hadoop Security book would have benefited from.

As noted, the Madness Beyond the Gate book is a WiP and there's no schedule to extend or complete it —just something written during test runs. I may finish it; I may get bored and distracted. But I welcome contributions from others, together we can have something which will be useful for those people coding in Hadoop —especially those who don't have the luxury of knowing who added Kerberos support to Hadoop, or has some security experts at the end of an email connection to help debug SPNEGO pain.

I've also put down for a talk on the same topic at Apachecon EU Data —let's see if it gets in.

(*) Flash removed except on Chrome browsers which I've had to go round and updated this week. The two-tier network is coming in once I set up a rasberry pi as the bridge, though with Ether-over-power the core backbone, life is tricky. And with PCs in the "trust zone", I'm still vulnerable to 0-days and the hazard imposed by other household users and my uses of apt-get, homebrew and maven & ivy in builds.I should really move to developing in VMs I destroy at the end of each week.

(**) plus it'd make for fantastic cover page art in an ORA book.