One of the things I've been busy doing this week is tightening the filesystem contract for Hadoop, as defined in FileSystemContractBaseTest : there are some assumptions in there that are clearly "so obvious" it wasn't worth testing.
- that isFile() is true for a file of size zero & size n for n>0
- that you can't rename a directory to a child dir of depth 1 and depth n>1
- that you can't rename the root directory to anything
- that if you write a file in one case, you can't read it using a filename in a different case (breaks on local for NTFS and HFS+)
- that if you have a file in one case and write a file with the same name in a different case, you have two files with different case values.
- that you can successfully read a file where the byte range can be in 0-255.
- That when you overwrite an existing file with new data, and read in that file again, you get the new data back.
- You can have long filenames and paths
The others? Things I thought of as I went along, looking at assumptions in the tests (byte ranges of generated test data), and assumptions so implicit nobody bothered to specify them: case logic, mv dir dir/subdir being banned, etc. And experience with Windows NT filesystems.
This is important, because FileSystemContractBaseTest is effectively the definition of the expected behaviour of a Hadoop-compatible filesystem.
It's in a medium-level procedural language, but it can be automatically verified by machines, at least for the test datasets provided, and we can hook it up to Jenkins for automated tests.
And when an implementation of FIleSystem fails any of the tests, we can point to it and say "what are you going to do about that?"
If there is a weakness: dataset size. HDFS will let you create a file of size > 1PB if you have a datacentre with the capacity and the time, but our tests don't go anywhere near that big. Even the tests against S3 & OpenStack (currently) don't try and push up files >4GB to see how that gets handled. I think I'll add a test-time property to let you choose the file size for a new test "testBigFilesAreHandled()"
The point is: tests are specifications. If you write the tests after the code they may simply be verifying your assumptions, but if you do them first, or you spend some time thinking "what are some foundational expectations -byte ranges, case sensitivity, etc- you can come up with more ideas about what is wanted -and more specifications to write.
Your tests can't prove that your implementation really, really matches all the requirements of the specifications, and its really hard to test some of the concurrency aspects (how to simulate the deletion/renaming of a sub-tree of a directory that is in the process renamed/deleted). Code walkthroughs are the best we can do in Java today.
Despite those limits, for the majority of the code in an application, tests + reviews are mostly adequate. Which is why I've stated before: the tests are the closest we have to a specification of Hadoop's behaviour, other than the defacto behaviour of what gets released by the ASF as a non-alpha, non-beta release of the Apache Hadoop source tree (with some binaries created for convenience).
Where are we weak?
- testing of concurrency handling.
- failure modes, especially in the distributed bit.
- behaviour in systems and networkings that are in some way "wrong" -Hadoop contains some implicit expectations about the skills of whoever installs it.
Suggestions?
[photo: van crashed into a wall due to snow and ice on nine-tree hill, tied to a bollard with some tape to stop it sliding any further]