That's why git-secrets is something you should preinstall on all repos where you go near repositories. Unfortunately, it's pattern matching picks up a lot of invalid patterns in the Hadoop codebase, and a few in spark. In the absence of a decent pattern to only scan text files, after installing I edit out the regexps from .git/config and rely on it scanning purely for the strings in ~/.aws/credentials.
That keeps the keys out of SCM, if everyone is set up that way.
Which leaves the next problem: if you save your hadoop configuration files to SCM, how do you get AWS keys into system configurations?
The answer: XInclude
This is one of my configuration files, specifically hadoop-tools/hadoop-aws/src/test/resources/auth-keys.xml
This is a special file, tagged as .gitignore to keep it out of the repos. Yet still the keys are in that source tree, still at risk of sneaking out.
To deal with this, stick an absolute XInclude reference into the file, pointing to the configuration file where the keys really live.
This tells the Hadoop config loader to grab the keys from a file in ~/.aws; one which lives carefully out of the SCM-managed space.
Provided the contents of that directory are kept private, my keys will not get checked in.
They can, however leak, in various ways, including
- In the /config URL of a service
- Code which accidentally logs it.
- If it gets saved to files/datasets used in bug reports.
- Malicious code running in your system which grabs the keys and exports/logs them. This is why no OSS jenkins servers are set up with the keys needed to test against object store.
There are some patches under the S3A phase 2 JIRA to support credentials better, [HADOOP-12723, HADOOP-12537]. This is somewhere where anyone who can test these patches is invaluable. I'm currently doing the review-then-commit in S3, but I'm only doing it at weekends, when I have some spare time, and even then, as a full test run takes 2+ hours, not reviewing very much.
Anyone can review Hadoop patches, confirm whether they worked or not, show how they didn't. It's the way to verify that forthcoming code from other people works for you, and of contributing time and effort into the community. Other people have done the coding —time to help with the testing.
Especially now I've just documented how to keep the keys safe when you set up S3 for a test run. Once the auth-keys.xml file is in the src/test/resources directory, Maven tests s3, depending on the specific properties, s3a, s3n and s3a are all tested.
Update: 2016-06-20: I've discovered that if you use the id:secret in a URL for an S3x filesystem, e.g s3a://AWS01:sec/ret@mybucket then the secrets get logged everywhere, because the Hadoop code assumes there aren't secrets in the Path or Filesystem URIs. We've cranked back a bit on the leakage (by stripping it from the FS URIs), but it still gets out everywhere. Fix: don't do that. Hadoop 2.8 will explicitly tell you off whenever you do so.
Update: 2018-02-07: Hadoop 2.9+ uses a more efficient Stax XML parser over a DOM based one. That broke this secret-sequestering somewhat dramatically. For Hadoop 2.9+, do not use the file: prefix
FWIW, it's possible to have base auth-key.xml with XIncludes which supports both parsers by having both references in the file, and add a <fallback/> element in both so an unresolvable ref is a no-op. Worth knowing if you need to test across all branches.
[photo: Mina Road Park, St Werburghs]