Solving the Netapp Open Solution for Hadoop Solutions Guide
I have been staring at the Netapp Open Solution for Hadoop Solutions Guide. I now have a headache.
Where do I begin? It's interesting to read a datasheet about Hadoop that is targeted at enterprise filesystem people. Not management "Hadoop solves all your problems", not developer "you have to rewrite all your apps", but something designed to convince people who worry about file systems that HDFS is a wrongness on the face of the planet. That's the only reason it would make sense. And like anyone trying to position a competing product to HDFS they have to completely discredit HDFS.
This paper, which tries to rejustify splitting the storage and compute platforms -the whole thing that the HDFS/Hadoop is designed to eliminate on cost grounds- has to try pretty hard.
In fact, I will say that after reading this paper that the MapR/EMC story makes a lot more sense. As what the Netapp paper is trying to say -unlike EMC- is "the Big Data platform that Hadoop is wonderful, provided you ignore the cost model of SATA-on-server that HDFS and equivalent filesystems offer for topology-aware applications".
This must have been a hard article to write, so I must give the marketing team some credit for the attempt, even if they got it so wrong technically.
First: did we really need new acronyms? Really? By Page 4 there is a new one, "REPCOUNT", that I have never come across before. Then the acronym "HDFS3" that I first thought meant version 3 of HDFS, but before I could switch to JIRA and see if someone had closed an issue "design, implement and release V3 of HDFS", I see it really means "A file saved to HDFS with the block.replication.factor attribute set to 3", or more simply, 3x replicated blocks. No need to add another acronym to a document that is knee deep in them.
Now, for some more serious criticisms, starting at page 5:
"For each 1 petabyte of data, 3 terabytes of raw capacity are needed".
If that were true, this would not be a criticism. No, it would be a sign of someone getting such a fantastic compression ration that Facebooks leading edge 30 PB server would fit into 30x3TB LFF HDDs, which you could fit into about 3U's worth of rack. If only. And don't forget that RAID uses 1.6X raw storage: it replicates too, just more efficiently.
Then there's the ingest problem. A whole section of this paper is dedicated to scaring people that they won't be able to pull in data fast enough because the network traffic will overload the system. We are not -outside my home network- using Ether over Power as the backplane any more. You can now get 46 servers with 12x3TB HDDs in them onto a single rack, with a 10GbE switch that run "with all the lights on". On a rack like that - which offers 500+TB of storage- you can sustain 10 GbE between any two servers. If the ingress and egress servers are hooked up to the same switch you could in theory move a Gigabyte per second between any two nodes or in and out the cluster. Nobody has an ingress right in that range, except maybe the physicists and their "odd" experiments. Everyone else can predict their ingress rate fairly simply, it is "look around at the amount of data you discard today". That's your ingress rate. Even a terabyte per day is pretty significant -yet on a 10GbE switched you could possibly import that from a single in-centre-off-HDFS-node in under 20 minutes. For anything out of the datacentre, your bandwidth will probably be less than 10 Gigabits, unless your name is Facebook, Yahoo!, Google or Microsoft.
Summary: ingress rate is unlikely to be bottlenecked by in-rack bandwidth even with 3x replication.
Next: Intra-HDFS bandwidth issues.
The paper says today that "A typical Hadoop node has 2GbE interfaces". No. A typical node tends to have a single 1GbE interface; bonded 2x1 GbE is rarer, as it's harder to commission.
Go that way and add on separate ToR switches and all concerns about "network switch failure" go away. You'd have to lose two network switches or power to the entire rack to trigger re-replication. At least NetApp did notice that 10GbE is going on the mainboard and said "Customers building modest clusters (of less than 128 nodes) with modern servers should consider 10GbaseT network topologies."
I'd say something different, which is "128 nodes is not a modest cluster". If you are building your rack from paired six-core CPUs with 12 LFF HDDs each with 3TB of SATA (how's that for acronyms?), then your cluster has 1536 individual cores. It will have 12*3*128 = 4608TB of storage: four petabytes. That's is not modest. That is something that can sort a petabyte of data with. Fit that up with a good 10GbE switch and enough RAM -and the latest 0.23 alpha release with Todd's CRC32c-in-SSE patch- and you could probably announce you have just won the petabyte sorting record.
Summary: "a 128 node cluster built of current CPUs and SATA storage is not modest".
It's a modest size -three racks- but it will probably be the largest file system your organisation owns. Think about that. And think about the fact that when people say "big data" what they mean is "lots of low value data". If you don't store all of it (aggressively compressed), you will regret it later.
Page 6: Drive failures. This page and its successors wound me up so much I had to stop. You see, I've been slowly doing some changes to trunk and 0.23 to tune HDFS's failure handling here, based on papers on failures by Google, Microsoft and -um- Netapp. This means I am starting to know more about the subject, and defend my statements.
The Netapp paper discusses the consequences of HDD Failure rates using the failure rate on 5K-node cluster to scare people. Yes, there will be a high failure rate there, but it's a background noise. The latest version of Hadoop 0.20.20x doesn't require a DataNode restart when a disk fails -take it away and only that 1-3 TB of data fails. When a disk fails -and this is where whoever wrote the paper really misses the point - that 2TB of missing data is still scattered across all other clusters in the rack.
If you have 128 servers in your "modest" cluster, 2TB disks and a block size of 256MB, then there were about 8000 blocks in the disk, which are now scattered across (128-1) servers. Sixtyish blocks per server. With twelve disks per server, that's about five blocks per SATA disk (=2500 MB). Even if -as claimed- the throughput of a single SATA disk is 30MB/s, those five blocks will take under two minutes to be read off disk. I'm not going to follow this chain through to network traffic as you'd also have to factor in the fact that servers are reading network packets from blocks being replicated to it at the same time and saving them to disk too (it should expect 60 blocks incoming), but the key point is this: on a 128 node cluster, the loss of a single 2TB cluster will generate a blip of network traffic, and your cluster will carry on as normal. No need for "24-hour staffing of data center technicians". No need for the ops teams to even get texted when a disk fails. Yes, loss of a full 36TB server is a bit more dramatic, but with 10 GbE it's still manageable. This paper is just trying to scare you based on a failure to understand how block replication is implicit striping of a file across the entire cluster, and they haven't looked at Hadoop's source to see what it does on a failure. Hint: look at BlockManager.java. I have.
The infrastructure handles failures, the apps are designed for it. This is not your old enterprise infrastructure. We have moved out of the enterprise world of caring about every disk that fails, and into a world of statistics.
I'm going to stop here. I can't be bothered to read the low level details if the high level stuff is so painfully wrong, either through inadequate technical review of the marketing team's world view, or deliberately through a failure to understand HDFS.
[Update 21:57 GMT]. Actually it is weirder than I first thought. This is still HDFS, just running on more expensive hardware. You get the (current) HDFS limitations: no native filesystem mounting, a namenode to care about, security on a par with NFS, without the cost savings of pure-SATA-no-licensing-fees. Instead you have to use RAID everywhere, which not only bumps up your cost of storage, puts you at risk of RAID controller failure and errors in the OS drivers for those controller (hence their strict rules about which Linux releases to trust). If you do follow their recommendations and rely on hardware for data integrity, you've cut down the probability of node-local job execution, so all FUD about replication traffic is now moot as at least 1/3 more of your tasks will be running remote -possibly even with the Fair Scheduler, which waits for a bit to see if a local slot becomes free. What they are doing then is adding some HA hardware underneath a filesystem that is designed to give strong availability out of medium availability hardware. I have seen such a design before, and thought it sucked then too. Information week says this is a response to EMC, but it looks more like NetApp's strategy to stay relevant, and Cloudera are partnering with them as NetApp offered them money and if it sells into more "enterprise customers" then why not? With the extra hardware costs of NetApp the cloudera licenses will look better value, and clearly both NetApp and their customers are in need of the hand-holding that Cloudera can offer.
I just wish someone from Cloudera had reviewed that solutions paper for technical validity before it got published.
[Artwork: ARYZ, Nelson Street]