Steve Loughran: Defending HDFS

It seems like everyone is picking on HDFS this week.

Why?

Some possibilities

There are some fundamental limitations of HDFS that suddenly everyone has noticed.
People who develop filesystems have noticed that Hadoop is becoming much more popular and wish to help by contributing code, tests and documentation to the open source platform, passing on their experiences running the Hadoop application stack and hardening Hadoop to the specific failure modes and other quirks of their filesystems.
Everyone whose line of business is selling storage infrastructure has realised that not only are they not getting new sales deals for hadoop clusters
Hadoop HDFS is making it is harder to justify the prices of "Big Iron" storage.

If you look at the press releases, action two, "test and improve the Hadoop stack" isn't being done by the "legacy" DFS vendors. These are the existing filesystems that are having Hadoop support retrofitted - usually by adding locality awareness to SAN-hosted location independence and a new filesystem driver with topology information for Hadoop. A key aid to making this possible is Hadoop's conscious decision to not support full Posix semantics, so it's easier to flip in new filesystems (a key one being Amazon S3's object store, which is also non-Posix).

I've looked at NetApp and Lustre before. Whamcloud must be happy that Intel bought them this week, and I look forward to consume beer with Eric Barton some time this week. I know they were looking at Hadoop integration -and have no idea what will happen now.

GPFS, well, I'll just note that they don't quote a price, instead say "have an account team contact you". If the acquisition process involves an account team, you know it wont be cents per GB. Client-side licensing is something I thought went away once you moved off Windows Server, but clearly not.

CleverSafe. This uses erasure coding as a way of storing data efficiently; it's a bit like parity encoding in RAID but not quite, because instead of the data being written to disks with parity blocks, the data gets split up into blocks and scattered through the DFS. Reading in the blocks involves pulling in multiple blocks and merging them. If you over-replicate the blocks you can get better IO bandwidth -grab the first ones coming in and discard the later ones.

Of course, you then end up with the bandwidth costs of pulling in everything over the network -you're in 10GbE territory and pretending there aren't locality issues, as well as worrying about bisection bandwidth between racks.

Or you go to some SAN system with its costs and limitations. I don't know what CleverSafe say here -potential customers should ask that. Some of the cloud block stores use e-coding; it keeps costs down and latency is something the customers have to take the hit on.

I actually think there could be some good opportunities to do something like this for cold data or stuff you want to replicate across sites: you'd spread enough of the blocks over 3 sites that you could rebuild them from any two, ideally.

Ceph: Readingthe papers, it's interesting. I haven't played with it or got any knowledge of its real-world limitations.

MapR. Not much to say there, except to note the quoted Hadoop scalability numbers aren't right. Today, Hadoop is in use in clusters up to 4000+ servers and 45+PB of storage (Yahoo!, Facebook). Those are real numbers, not projections from a spreadsheet.

There are multiple Hadoop clusters running at scales of 10-40 PB clusters, as well as lots of little ones from gigabytes up. From those large clusters, we in the Hadoop dev world have come across problems, problems we are, as an open source project, perfectly open about.

This does make it easy for anyone to point at the JIRA and say "look, the namenode can't...", or "look, the filesystem doesn't..." That' something we just have to recognise and accept.

Fine: other people can point to the large HDFS clusters and say "it has limits", but remember this: they are pointing at large HDFS clusters. Nobody is pointing at large Hadoop-on-filesystem-X clusters, for X != HDFS, -because there aren't any public instances of those.

All you get are proof of concepts, powerpoint and possibly clusters of a few hundred nodes -smaller than the test cluster I have access to.

If you are working on a DFS, well, Hadoop MapReduce is another use case you've got to deal with -and fast. The technical problem is straightforward -a new filesystem client class. The hard part is solving the economics problem of a filesystem that is designed to not only store data on standard servers and disks -but to do the computation there.

Any pure-storage story has to explain why you also need a separate rack or two of compute nodes, and why SAN Failures aren't considered a problem.

Then they have to answer a software issue: how can they be confident that the entire Hadoop software stack runs well on their filesystem? And if it doesn't, what processes have they in place to get errors corrected -including cases where the Hadoop-layer applications and libraries aren't working as expected?
Another issue is that some of the filesystems are closed source. That may be good from their business model perspective, but it means that all fixes are at the schedule of the sole organisation with access to the source. Not having any experience of those filesystems, I don't know whether or not that is an issue. All I can do is point out that it took Apple three years to make the rename() operation atomic, and hence compliant with POSIX.. Which is scary as I do use that OS on my non-Linux boxes. And before that, I used NTFS, which is probably worse.

Hadoop's development is in the open; security is already in HDFS (remember when that was a critique? A year ago?), HA is coming along nicely in the 1.x and 2.x lines. Scale limits? Most people aren't going to encounter them, so don't worry about that. Everyone who points to HDFS and says "HDFS can't" is going to have to find some new things to point too soon.

For anyone playing with alternate filesystems to hdfs:// file:// and s3://, then -here are some things to ask your vendor:

How do you qualify the Hadoop stack against your filesystem?
If there is an incompatibility, how does it get fixed?
Can I get the source, or is there an alternative way of getting an emergency fix out in a hurry?
What are the hardware costs for storage nodes?
What are the hardware costs for compute nodes?
What are the hardware costs for interconnect.
How can I incrementally expand the storage/compute infrastructure.
What are the licencing charges for storage and for each client wishing to access it?
What is required in terms of hardware support contracts (replacement disks on site etc), and cost of any non-optional software support contracts?
What other acquisition and operational costs are there?

I don't know the answers to those questions -they are things to ask the account teams. From the Hadoop perspective:

Qualification is done as part of the release process of the Hadoop artifacts.
Fix it in the source, convince someone else too (support contracts, etc)
Source? See http://hadoop.apache.org/
Server Hardware? SATA storage -servers depend on CPU and RAM you want.
Compute nodes? See above.
Interconnect? Good q. 2x1GbE getting more popular, I hear. 10 GbE still expensive
Adding new servers is easy, expanding the network may depend on the switches you have.
Licensing? Not for the Open Source bits.
H/W support: you need a strategy for the master nodes, inc. Namenode storage.
There's support licensing (which from Hortonworks is entirely optional), and the power budget of the servers.

Server power budget is something nobody is happy about. It's where reducing the space taken up by cold data would have add-on benefits -there's a Joule/bit/year cost for all data kept on spinning-media. The trouble is: there's no easy solution.

I look forward to a time in the future when solid state storage competes with HDD on a bit by bit basis, and that cold data can be moved to it -where wear levelling matters less precisely because it is cold- and warm data can live on it for speed of lookup as well as power. I don't know when that time will be -or even if.

[Artwork, Limited Press on #4 Hurlingham Road. A nice commissioned work.]

1 comment:

MilindB19 Jul 2012, 06:08:00
Based on my interactions with actual decision-makers, "future-proofing" their investments is extremely important to them. So, No one would buy 2x1GbE today. They would rather go to 10GbE. So, I would love to see this "future-proofing" in HDFS architecture.

Unfortunately, since it is based on the GFS (old, pre-colossus, based on the old network design assumptions) architecture, and somehow that has become gospel, I don't see it as the filesystem of choice that is futureproof. I have had several of these discussions with the HDFS honchos, and they are still living in the 100Mbit ethernet world. I would really like to have them justify why they see a single spinning rust-based media delivering 1 GBps. I wish they just look at the actual numbers they get in real deployments (20MBps per disk), rather than the benchmarks, and decide based on that.

Comments are usually moderated -sorry.

2012-07-19

Defending HDFS

1 comment: