I've been reading the new ORA book,
Hadoop Security, by Ben Spivey and Joey Echeverria. There's not many reviews up there, so I'll put mine up
Summary
- reasonable intro to kerberos hadoop clusters
- covers the identity -> cluster user mapping problem
- ACLs in HDFS, YARN &c covered nicely —explanation and configuration
- Shows you pretty much how to configure every Hadoop service for authenticated and authorized access, audit loggings and data & transport encryption.
- has Sentry coverage, if that matters to you
- Has some good "putting it all together" articles
- Index seems OK.
- Avoids delving into the depths of implementation (strength and weakness)
Overall: good from an ops perspective, for anyone coding in/against Hadoop, background material you should understand —worth buying.
I'd bought a copy of the ebook while it was still a work in progress, so I got to see the original Chapter 2, "securing distributed systems: chapter to come". I actually think they should have left that page as it is on the basis that Distributed System Security is a Work in Progress. And while it's easy for all of us to say "defence in depth", none of us really practice that properly even at home. Where is the two-tier network with the fundamentally untrustable IoT layer: TVs, light bulbs, telephones, bittorrent servers, on a separate subnet from the critical household infrastructure from the desktops, laptops and home servers. How many of us keep our ASF, SSH and github credentials on an encrypted USB stick which must be unlocked for use? None of us. Bear that in mind whenever someone talks about security infrastructure: ask them how they lock down their house. (*)
Kerberos is the bit I worry about day to day, so how does it stack up?
I do think it covers the core concepts-as-a-user, and has a workflow diagram which presents time quite nicely. It avoids going in to those details of the protocol, which, as anyone who has ever read Colouris & Dolimore will note, is mindnumbingly complex and does hit the mathematics layer pretty hard. A good project for learning TLA+ would probably be "specify Kerberos"
ACLs are covered nicely too, while encryption covers HDFS, Linux FS and wire encryption, including the shuffle.
There's coverage of lots of the Hadoop stack, core Hadoop, HBase, Accumulo, Zookeeper, Oozie & more. There's some specifics on Cloudera bits: Impala, Sentry, but not exclusively and all the example configs are text files, not management tool centric: they'll work everywhere.
Overall then: a pretty thorough book on Hadoop security, for a general overview of security, Kerberos, ACLs and configuring Hadoop it brings together everything in to one place.
If you are trying to secure a Hadoop cluster, invest in a copy
Limitations
Now, where is it limited?
1. A lot of the book is configuration examples for N+ services & audit logs. it's a bit repetitive, and I don't think anybody would sit down and type those things in. However, there are so many config files in the Hadoop space, and at least how to configure all the many services is covered. It just hampers the readability of the book.
2. I'd have liked to have seen the HDFS encryption mechanism illustrated, especially KMS integration. It's not something I've sat down to understand, and the same UML sequence diagram style used for Kerberos would have gone down.
3. It glosses over precisely how hard it is to get Kerberos working, how your life will be frittered away staring at error messages which make no sense whatsoever, only for you to discover later they mean "java was auto updated and the new version can't do long-key crypto any more". There's nothing serious in this book about debugging a Hadoop/Kerberos integration which isn't working.
4. Its bit on coding against Kerberos is limited to a couple of code snippets around UGI login and doAs. Given how much pain it it takes to get Kerberos to work client side, including ticket renewal, delegation token creation, delegation token renewal, debugging, etc, one and a half pages isn't even a start.
Someone needs to document Hadoop & Kerberos for developers —this book isn't it.
I assume that's a conscious decision by the authors, for a number of valid reasons
- It would significantly complicate the book.
- It's a niche product, being for developers within the Hadoop codebase.
- It'd make maintenance of the book significantly harder.
- To write it, you need to have experienced the pain of adding a new Hadop IPC, writing client tests against in-VM zookeeper clusters locked down with MiniKDC instances, or tried to debug why Jersey+SPNEGO was failing after 18 hours on test runs.
The good news is that I have experience the suffering of getting code to work on a secure Hadoop cluster, a
nd want to spread that suffering more broadly.
For that reason, I would like to announce the work in progress, gitbook-toolchained ebook:
This is an attempt to write down things I've learned, using a Lovecraftian context to make clear this is forbidden knowledge that will drive the reader insane**. Which is true. Unfortunately, if you are trying to write code to work in a Hadoop cluster —especially YARN applications or anything acting as a service for callers, be they REST or IPC, you need to know this stuff.
It's less relevant for anyone else, though the
Error Messages to Fear section is one of the things I felt the Hadoop Security book would have benefited from.
As noted, the
Madness Beyond the Gate book is a WiP and there's no schedule to extend or complete it —just something written during test runs. I may finish it; I may get bored and distracted. But I welcome contributions from others, together we can have something which will be useful for those people coding in Hadoop —especially those who don't have the luxury of knowing who added Kerberos support to Hadoop, or has some security experts at the end of an email connection to help debug SPNEGO pain.
I've also put down for a talk on the same topic at Apachecon EU Data —let's see if it gets in.
(*) Flash removed except on Chrome browsers which I've had to go round and updated this week. The two-tier network is coming in once I set up a rasberry pi as the bridge, though with Ether-over-power the core backbone, life is tricky. And with PCs in the "trust zone", I'm still vulnerable to 0-days and the hazard imposed by other household users and my uses of apt-get, homebrew and maven & ivy in builds.I should really move to developing in VMs I destroy at the end of each week.
(**) plus it'd make for fantastic cover page art in an ORA book.