My other computer is a datacentre. A very small one.

Once upon a time every computer that filled a whole room, and many people had to look after it. The people who wanted to use it had to use punched cards; submit work and wait for an indeterminate period of time to see the results of that work.

Then minicomputers came, which were only the size of a cabinet in a room. These could be shared between less people, and with terminals, be more interactive. And from these machines came Unix, time_t, and buffer overflows in sprintf() statements.

It was still shared; you still had to compete with other people for finite resources. This is why when the microcomputer came along, and then the interactive workstation, something profound happened: you could do more. You could write interactive programs and be more productive without the long waits. That was before web browser updates, interactive downloading of emergency flash updates and AV Scanners got in the way of productive work, so desktop workstations were actually useful. Of course, all these machines were not that well connected on their own, so people would run round holding floppy disks, until networking products became available. Ethernet itself dates from the era of the workstation, though apparently it's frame length is partly driven by DEC's need to keep memory costs down, hence its pitiful size today. Ethernet, Netware, email and the like evolved, so now people can send me emails inviting me to phone conferences at 01:00 UK time, invitations that will somehow be tentatively stuck in my shared calendar without my knowledge, then synchronised to my phone, so that it can wake me from my sleep to let me know that I am late for a phone conference that I didn't know about. Truly, we are blessed.

And yet the wheel goes round. What is fashionable again? The Datacentre. A room full of boxes turning electrons into waste heat the way thermodynamic entropy requires, routing many of the through the CPUs in the process, so doing useful things. The first big datacentres used many racks of machines, each box with 1-2 cpus, a couple of 512MB HDDs and 1 GbE between them. Now you could build up a single rack with the same storage capacity and compute power.

But you'd still have to share it. Which is why I'm pleased to show off a little addition to our facility: a very small Hadoop cluster.

My other computer is a datacentre

These are four SL390s servers in 2 U's of rack; the two bits below are just expansion slots in the 4U chassis.

Each one the same basic node used in one of the top 10 supercomputers, though they have many more units Infiniband interconnect and a total power budget of 1.4MW, which is not something I'd want.

The front of the boxes contain all the interconnects; 2x10GbE on the motherboard, and a management ports hooked up to 100 MbE for ILO management;.. Having the ports at the front is something that Allen W has complained about. It does make sense if your switches work that way, if you can set your hold/cold aisles up so that the fronts are accessible. If your switch has its ports at the back, well, it's "suboptimal"

Round the back: power only. Shared PSUs and fans for a bit more resilience.

My other computer is a datacentre

The twin-socketed Xeon E-series parts have a relatively low power budget for x86-64 servers, though not in ARM terms; the multiple SFF HDDs you can fit into each unit give pretty good bandwidth and 4TB of storage. If you opt for the 3.5" HDDs your bandwidth drops, but you get 6TB/node.

Then there's the RAM: up to 192 GB/node. These ones have a bit less than that in, but there's still more per core than the entire RAM supply in my house.

From a storage perspective, there's not that much capacity: 16-24 TB. The ratio of store:compute and drive:compute is pretty good though, and as you can also sneak in a GPU, if you have compute intensive work, these four machines make for a nice little setup. And given that those TB of storage don't need to be shared with anyone else, it's not so bad.

This then is capable of storing and working through a reasonable amount of data, building up complex in-memory structures and being as responsive mid-afternoon as it is on a weekend, as nobody else is trying to do stuff on it.

At this scale HDFS makes no sense. You don't have the capacity to handle a server failure; 3X replication is too expensive. Better to RAID everything and NFS cross mount the filesystems.

I know the big cluster people will look at these boxes with bemusement, but think about this
  1. It's not the only cluster I have access to. This one is free for me to play with new versions and code on without causing problems.
  2. I'm sure the mainframe people didn't think much of minicomputers, minicomputer aficionados looked down on desktop computers, and -as we can see- desktop computers are having do accept the growing functionality of phones and other devices.
This then, is my own personal datacentre.


Alternate Hadoop filesystems

Hillgrove Porter House: Smoke room

My coverage of Netapp's Hadoop story not only generated a blip in network traffic to the site, but a blip in people from NetApp viewing my LinkedIn profile (remember: the mining cuts both ways), including one Val Bercovici -I hope nobody was too offended. That's why I stuck in a disclaimer about my competence*.

I'm not against running MapReduce -or the entire Hadoop stack- against alternate filesystems. There are some good cases where it makes sense. Other filesystems offer security, NFS mounting, the ability to be used by other applications and other features. HDFS is designed to scale well on "commodity" hardware, (where servers containing Xeon E5 series parts with 64+GB RAM, 10GbE and 8-12 SFF HDDs are considered a subset of "commodity"). HDFS makes some key assumptions
  1. The hardware is unreliable --replication addresses this.
  2. Disk and network bandwidth is limited --again, replication addresses this.
  3. Failures are mostly independent, therefore replication is a good strategy for avoiding data loss
  4. Sub-POSIX semantics are all you need --makes handling partitioning easier, locking, etc.
The independence of failure is something I will have to look at some other time, the key thing being it doesn't extend to virtualised "cloud" infrastructures" where all your VMs may be hosted on the same physical machine. A topic for another day.

What I will point to instead is a paper by Xyratex looking at running Hadoop MR jobs over Lustre. I'm not actually going to argue with this paper (much) as it does appear to show substantial performance benefits with the combination of Lustre + Infiniband. They then cost it out and argue that for the same amount of storage, the RAIDed LustreFS capacity is less and that delivers long term power savings as well as purchase price.

I don't know if the performance numbers stand up, but at the very least Lustre does offer other features that HDFS doesn't: it's more general purpose, NFS mountable, supports many small files, and, with the right network, delivers lots of data to anywhere in the cluster. Also Eric Barton of Whamcloud bought us beer at the Hillgrove at our last Hadoop get together.

Cost wise, I look at those number and don't know what to make of them. On page 12 of their paper (#14 in the PDF file) and say that you only need to spend 100x$7500 for the number of nodes in the cluster, so the extra costs of Infiniband networking , and $104K for the storage are justifiable, as the total cluster capital cost comes in at half, and the power budget would be half too, leading to lower running costs. These tables I would question.

They've achieved much of the cost saving by saying "oh look, you only need half as many servers!" True, but that's cut your computation power in half too. You'd have to look hard at the overhead imposed by the datanode work on your jobs to be sure that you really can go down to half as many disks. Oh, and each node would gain from having a bit of local FS storage for logs and overspill data, because it costs more in the big servers, and local storage is fine there.

IBM have stood Hadoop up against GPFS. This makes sense if you have an HPC cluster with a GPFS filesystem nearby,  want an easier programming model than MPI --or the ability to re-use the Hadoop++ layers. GPFS delivers fantastic bandwidth to any node in the cluster, it just makes the cost of storage high. You may want to consider having HDDs in the nodes, using that for the low value bulk data, and using GPFS for output data, commonly read data, and anything where your app does want to seek() a lot. There's nothing in the DFS APIs that stop you having the job input or output FS separate from your fs.default.name, after all.
When running Hadoop in AWS EC2, again the S3 FS is a good ingress/egress FS. Nobody uses it for the intermediate work, but the pay by the kilo storage costs are lower than the blockstore rates, and where you want to keep the cold data.

There's also the work my colleague Johannes did on deploying Hadoop inside the IBRIX filesystem:
This design doesn't try and deploy HDFS above an existing RAIDed storage system; it runs location-aware work inside the filesystem itself. What does that give you?
  • The existing FS features: POSIX, mounting, etc.
  • Different scalability behaviours.
  • Different failure modes: bits of the FS can go offline, and if one server fails another can take over that storage
  • RAID-level replication, not 3X.
  • Much better management tooling.
I don't know what the purchasing costs would be -talk to someone in marketing there- but I do know the technical limits.
  • Less places to run code next to the data, so more network traffic.
  • Rack failures would take data offline completely until it returned. And as we know, rack failures are correlated
  • Classic filesystems are very fussy about OS versions and testing, which may create conflict between the needs of the Hadoop layers and the OS/FS.
Even so: all these designs can make sense, depending on your needs. What I was unimpressed by was trying to bring up HDFS on top of hardware that offered good availability guarantees anyway -because HDFS is designed to expect and recover from failure of medium availability hardware.

(*) For anyone doubts those claims about my competence and the state of the kitchen after I bake anything would have changed their opinions after seeing what happened when I managed to drop a glass pint bottle of milk onto a counter in the middle of the kitchen. I didn't know milk could travel that far.

[Photo: Hillgrove Porter Stores. A fine drinking establishment on the hill above Stokes Croft. Many beers are available, along with cheesy chips]


Attack Surface

Pedestrians: push button for a cup of tea

I allege that
  1. Any program that is capable of parsing untrusted content is vulnerable to exploitation by malicious entities.
  2. Programs which parse binary file contents are particularly vulnerable due to the tendency of such formats to use offset pointers and other values which the parsers assume are valid -but which "fuzzing" can be used to discover new exploits.
  3. Programs that treat the data as containing executable content -even within a sandbox- are vulnerable to any exploit that permits code to escape the sandbox, or simply to Denial of Service attacks in which resources in the sandbox such as memory, CPU time and network bandwidth are consumed.
  4. Recent incidents involving the generation of false certificates for popular web sites have shown that signed web sites and signed content cannot be consistently trusted.
  5. Any program that can process content downloaded by an HTTP client application is capable of parsing untrusted content, and so vulnerable to exploitation.
  6. Any program that can process content received in the payload of email messages is capable of parsing untrusted content, and so vulnerable to exploitation.
  7. Any program that can process delivered in untrusted physical media is capable of parsing untrusted content, and so vulnerable to exploitation.
The last of these items concerns me today. The Stuxnet worm has shown how a well-engineered piece of malware can propagate between windows machines, and across an airgap from windows machines to SCADA infrastructure by way of USB infection. The Sony Rootkit Scandal showed that even large business were willing to use this to sneak on anti-copying features on what pretended to be a music CD.

A key conclusion from the CERT advisory on the Sony Incident was Disable automatically running CD-ROMs by editing the registry to change the Autorun value to 0 (zero) as described in Microsoft Article 155217.

I've long disabled autorun on windows boxes, and I also remember testing a long time ago whether or not you could autorun a CD while the screen was locked. Answer: no. Windows waits until the screen is unlocked before an autorun application is executed. This provides some defence against malicious CDs, though not as much as disabling autorun completely.

I don't worry too much about windows security these days because my main machines are Linux boxes, which keep themselves up to date with apt-get update. This keeps browsers, flash, JVMs, OpenOffice current, which reduces their vulnerability. I also keep an eye on the SANS newsfeed to see if there are any 0-day exploits out, and take proactive action if anything gets out into the wild that we are at risk from. I've stopped fielding service requests from family members saying "there is something on my machine -can you clean it up". If I am asked, my response is "I will reformat your HDD and install ubuntu on it". FWIW, I have done this for one relative and it works really well.

I am confident that I do keep my machines locked down, and know that the primary vulnerability on an Linux system on which I do open source development is that I implicitly have to trust everyone who has commit rights to every application that I use on it.

If there is one issue I have with Linux is that it is over-paranoid. For example, once the screen is locked I can't suspend the system. This means that I can't suspend the laptop until I unlock it and hit the suspend button. Somewhere on launchpad there was a bugrep about that, but people who looked after servers were saying "no!", despite the fact that if I had physical access to the power button, it would be only four-seconds away from a hard power off, and then, unless the system had the CD-ROM, USB and PXE boot disabled, one to two minutes from being my box. It's annoying but a sign that security is taken over-seriously.

Imagine my surprise then, last week, on inserting a CD-ROM from an external source when suddenly the system started grinding away, with a tiny thumbnail of each PDF image appearing one after the other. For that to happen the desktop had to have scanned through the list of files, handed each PDF file off to some parser and got a 64x64 graphic back. It wasn't the wasted effort that concerned me -it was the way it did this without me even asking.

I did some research and discovered an article on the topic. Essentially the Linux desktop standard allows for autostart.sh files to be executed on mounting a media device. The the thumbnailing is something else: nautilus defaults to doing this for local files -and views all files on mounted local media as "local", rather than "untrusted content on an untrusted filesystem". Given that exploits have been found in the thumbnailing code, having Nautilus thumbnail files on an external device is insane.

The best bit: this happens even while the desktop is locked.

There I am, regularly saying rude words because I thought I'd suspended my laptop, but as the screen was locked it's just stayed powered up, slowly killing the battery and/or overheading in a bag. I accepted this inconvenience as I thought it was a security feature for my own good. It turns out that it was irrelevant as all you needed to do was plug in a USB stick with a any .dvi, .png or .pdf file designed to exploit some security hole and my box is 0wned. Which, if done while I am logged in, offers access to the encrypted bits of the box.

There is a workaround -disable all thumbnailing even of local content, as well as the usual "do nothing" action on handling media insertions. I've done that everywhere. Yet what annoys me is this: what were the dev team thinking. On the one hand there is the power management group saying "we won't let people suspend while the desktop is locked for security reasons", while some other people are thinking "let's generate thumbnails whenever new media is inserted". The strength of OSS is that anyone can contribute. The weakness: a lot of them aren't devious enough to recognise what they think of a feature is someone else's route of access to your systems

[Photo: Park Street]


Disclaimer: I reserve the right to not know what I'm talking about

break time

There have been some links to this blog with the implications that either because I am a researcher at HP Laboratiories with commit rights to a number of projects -including Apache Hadoop- then my opinions may have some value.

Not so.

Everything I say is obviously my own opinions, do not reflect those of my employer, may be at odds with corporate strategies of which I am blissfully unaware of, if I do mention corporate plans they may change on a whim, etc, etc.

I do work at HP Labs -which is a fun place to work- but think about what that means. It means that I don't do stuff that goes straight into shipping products. The code I write is not considered safe for humanity. I may talk about it, I may write about it, some may get into the field, but only after rigorous review by people who care about such things.

This makes it a bit like a university, except that I don't teach courses. Cleary I am not considered safe near students either, in case I damage thier minds.

"Oh well", people say, "he's a Hadoop committer -that must count for something?"

It does, but I don't get to spend enough time there to keep up with the changes, let alone do interesting stuff of my own ideas.

Whenever I create an issue and say "I will do this", other people in the project -Arun, Tom, Todd and others will stare at their inbox in despair, with the same expression on their faces that my family adopts when I say "I will bake something".

Because then, yes, I may come up with something unusual and possibly even tasty, but they will know that the kitchen will be left with a thin layer of flour over everything, there will be pools of milk, and whatever went in the oven went into the closest bowl-shaped thing I could reach with one hand, rather than anything designed to be used in an oven, let alone watertight. Even if I didn't use flour or milk.

Bear that in mind before believing anything written here.

[Photo:me somewhere in the Belledone mountains; French Alps. For some reason French villages never like it when people on bicycles come to a halt]


Towards a Topology of Failure

Mt Hood Expedition

The Apache community is not just the mailing lists and the get togethers: it is the planet apache aggregate blog; this lets other committers share their thoughts -and look at yours. This makes for unexpected connections.

After I posted my comments on Availability in Globally Distributed Storage Systems, Phil Steitz posted wonderful article on the mathematics behind it. This impressed me, not least because of his ability to get TeX-grade equations into HTML. What he did do was look at the real theory behind it, and even attempted to implement Dynamic Programming solution the problem.

I'm not going to be that ambitious, but I will try and link this paper -and the other ones on server failures, into a new concept, "Failure Topology". This is an excessively pretentious phrase, but I like it -if it ever takes off I can claim I was the first person to use it, as I can do with "continuous deployment"

The key concept of Failure Topology is that failures of systems often follow topologies. Rack outages can be caused by rack-level upgrades. Switch-level outages can take out 1* racks and are are driven by the network topology. Power outages can be caused by the failure of power supplies to specific servers, sets of servers, specific racks or even quadrants of a datacentre. There's the also the notion of specific hardware instances, such as server batches with the same run of HDDs or CPU revisions.

Failure topologies, then, are maps of the computing infrastructure that show how these things are related, and where the risk lies. A power topology would be a map of the power input to the datacentre. We have the switch topology for Hadoop, but it is optimised for network efficiency, rather than looking at the risk of switch failure. A failure-aware topology would need to know which racks were protected by duplicate ToR switches and view them as less at risk then single-switch racks. Across multiple sites you'd need to look at the regional power grids, the different telcos. Then is the politics overlay: what government controls the datacentre sites; whether or not that government is part of the EU and hence has data protection rights, or whether there's some DMCA-style takedown rules.

You'd also need to look at physical issues: fault lines, whether the sites were downwind of Mt St Helen's class volcanoes. That goes from abstract topologies to physical maps.

What does all this mean? Well, in Disk-Locality in Datacenter Computing Considered Irrelevant, Ganesha Ananthanarayanan argues that as switch and backplane bandwidth increases you don't have to worry about where your code runs relative to the data. I concur: with 10GbE and emerging backplanes, network bandwidth means that switch-local vs switch-remote will become less important. Which means you can stop worrying about Hadoop topology scripts driving code execution policies. Doing this now opens a new possibility:

Writing a script to model the failure topology of the datacentre.

You want to move from a simple "/switch2/node21" map to one that includes power sources, switches, racks, shared-PSUs in servers, something like "/ups1/switch2/rack3/psu2/node21". This models not the network hierarchy, but the failure topology of the site. Admittedly, it assumes that switches and racks share the same UPS, but if the switch power source goes away, the rack is partitioned and effectively offline anyway -so you may as well do that.

I haven't played with this yet, but as I progress with my patch to allow Hadoop to focus on keeping all blocks on separate racks for availability, this failure topology notion could be the longer term topology that the ops team need to define.

[Photo: sunrise from the snowhole on the (dormant) volcano Mt Hood -not far from Google's Dalles datacentre]


Solving the Netapp Open Solution for Hadoop Solutions Guide

See No Evil

I have been staring at the Netapp Open Solution for Hadoop Solutions Guide. I now have a headache.

Where do I begin? It's interesting to read a datasheet about Hadoop that is targeted at enterprise filesystem people. Not management "Hadoop solves all your problems", not developer "you have to rewrite all your apps", but something designed to convince people who worry about file systems that HDFS is a wrongness on the face of the planet. That's the only reason it would make sense. And like anyone trying to position a competing product to HDFS they have to completely discredit HDFS.

This paper, which tries to rejustify splitting the storage and compute platforms -the whole thing that the HDFS/Hadoop is designed to eliminate on cost grounds- has to try pretty hard.

In fact, I will say that after reading this paper that the MapR/EMC story makes a lot more sense. As what the Netapp paper is trying to say -unlike EMC-  is "the Big Data platform that Hadoop is wonderful, provided you ignore the cost model of SATA-on-server that HDFS and equivalent filesystems offer for topology-aware applications".

This must have been a hard article to write, so I must give the marketing team some credit for the attempt, even if they got it so wrong technically.

First: did we really need new acronyms? Really? By Page 4 there is a new one, "REPCOUNT", that I have never come across before. Then the acronym "HDFS3" that I first thought meant version 3 of HDFS, but before I could switch to JIRA and see if someone had closed an issue "design, implement and release V3 of HDFS", I see it really means "A file saved to HDFS with the block.replication.factor attribute set to 3", or more simply, 3x replicated blocks. No need to add another acronym to a document that is knee deep in them.

Now, for some more serious criticisms, starting at page 5

"For each 1 petabyte of data, 3 terabytes of raw capacity are needed".

If that were true, this would not be a criticism. No, it would be a sign of someone getting such a fantastic compression ration that Facebooks leading edge 30 PB server would fit into 30x3TB LFF HDDs, which you could fit into about 3U's worth of rack. If only. And don't forget that RAID uses 1.6X raw storage: it replicates too, just more efficiently.

Then there's the ingest problem. A whole section of this paper is dedicated to scaring people that they won't be able to pull in data fast enough because the network traffic will overload the system. We are not -outside my home network- using Ether over Power as the backplane any more. You can now get 46 servers with 12x3TB HDDs in them onto a single rack, with a 10GbE switch that run "with all the lights on". On a rack like that - which offers 500+TB of storage- you can sustain 10 GbE between any two servers. If the ingress and egress servers are hooked up to the same switch you could in theory move a Gigabyte per second between any two nodes or in and out the cluster. Nobody has an ingress right in that range, except maybe the physicists and their "odd" experiments. Everyone else can predict their ingress rate fairly simply, it is "look around at the amount of data you discard today". That's your ingress rate. Even a terabyte per day is pretty significant -yet on a 10GbE switched you could possibly import that from a single in-centre-off-HDFS-node in under 20 minutes. For anything out of the datacentre, your bandwidth will probably be less than 10 Gigabits, unless your name is Facebook, Yahoo!, Google or Microsoft.

Summary: ingress rate is unlikely to be bottlenecked by in-rack bandwidth even with 3x replication.

Next: Intra-HDFS bandwidth issues.

The paper says today that "A typical Hadoop node has 2GbE interfaces". No. A typical node tends to have a single 1GbE interface; bonded 2x1 GbE is rarer, as it's harder to commission.
Go that way and add on separate ToR switches and all concerns about "network switch failure" go away. You'd have to lose two network switches or power to the entire rack to trigger re-replication. At least NetApp did notice that 10GbE is going on the mainboard and said "Customers building modest clusters (of less than 128 nodes) with modern servers should consider 10GbaseT network topologies."

I'd say something different, which is "128 nodes is not a modest cluster". If you are building your rack from paired six-core CPUs with 12 LFF HDDs each with 3TB of SATA (how's that for acronyms?), then your cluster has 1536 individual cores. It will have 12*3*128 = 4608TB of storage: four petabytes. That's is not modest. That is something that can sort a petabyte of data with. Fit that up with a good 10GbE switch and enough RAM -and the latest 0.23 alpha release with Todd's CRC32c-in-SSE patch- and you could probably announce you have just won the petabyte sorting record.

Summary: "a 128 node cluster built of current CPUs and SATA storage is not modest".

It's a modest size -three racks- but it will probably be the largest file system your organisation owns. Think about that. And think about the fact that when people say "big data" what they mean is "lots of low value data". If you don't store all of it (aggressively compressed), you will regret it later.

Page 6: Drive failures. This  page and its successors wound me up so much I had to stop. You see, I've been slowly doing some changes to trunk and 0.23 to tune HDFS's failure handling here, based on papers on failures by Google, Microsoft and -um- Netapp. This means I am starting to know more about the subject, and defend my statements.

The Netapp paper discusses the consequences of HDD Failure rates using the failure rate on 5K-node cluster to scare people. Yes, there will be a high failure rate there, but it's a background noise. The latest version of Hadoop 0.20.20x doesn't  require a DataNode restart when a disk fails -take it away and only that 1-3 TB of data fails. When a disk fails -and this is where whoever wrote the paper really misses the point - that 2TB of missing data is still scattered across all other clusters in the rack.

If you have 128 servers in your "modest" cluster, 2TB disks and a block size of 256MB, then there were about 8000 blocks in the disk, which are now scattered across (128-1) servers. Sixtyish blocks per server. With twelve disks per server, that's about five blocks per SATA disk (=2500 MB). Even if -as claimed- the throughput of a single SATA disk is 30MB/s, those five blocks will take under two minutes to be read off disk. I'm not going to follow this chain through to network traffic as you'd also have to factor in the fact that servers are reading network packets from blocks being replicated to it at the same time and saving them to disk too (it should expect 60 blocks incoming), but the key point is this: on a 128 node cluster, the loss of a single 2TB cluster will generate a blip of network traffic, and your cluster will carry on as normal. No need for "24-hour staffing of data center technicians". No need for the ops teams to even get texted when a disk fails. Yes, loss of a full 36TB server is a bit more dramatic, but with 10 GbE it's still manageable. This paper is just trying to scare you based on a failure to understand how block replication is implicit striping of a file across the entire cluster, and they haven't looked at Hadoop's source to see what it does on a failure. Hint: look at BlockManager.java. I have.

To summarise:

The infrastructure handles failures, the apps are designed for it. This is not your old enterprise infrastructure. We have moved out of the enterprise world of caring about every disk that fails, and into a world of statistics. 

I'm going to stop here. I can't be bothered to read the low level details if the high level stuff is so painfully wrong, either through inadequate technical review of the marketing team's world view, or deliberately through a failure to understand HDFS.

[Update 21:57 GMT]. Actually it is weirder than I first thought. This is still HDFS, just running on more expensive hardware. You get the (current) HDFS limitations: no native filesystem mounting, a namenode to care about, security on a par with NFS, without the cost savings of pure-SATA-no-licensing-fees. Instead you have to use RAID everywhere, which not only bumps up your cost of storage, puts you at risk of RAID controller failure and errors in the OS drivers for those controller (hence their strict rules about which Linux releases to trust). If you do follow their recommendations and rely on hardware for data integrity, you've cut down the probability of node-local job execution, so all FUD about replication traffic is now moot as at least 1/3 more of your tasks will be running remote -possibly even with the Fair Scheduler, which waits for a bit to see if a local slot becomes free. What they are doing then is adding some HA hardware underneath a filesystem that is designed to give strong availability out of medium availability hardware. I have seen such a design before, and thought it sucked then too.  Information week says this is a response to EMC, but it looks more like NetApp's strategy to stay relevant, and Cloudera are partnering with them as NetApp offered them money and if it sells into more "enterprise customers" then why not? With the extra hardware costs of NetApp the cloudera licenses will look better value, and clearly both NetApp and their customers are in need of the hand-holding that Cloudera can offer.

I just wish someone from Cloudera had reviewed that solutions paper for technical validity before it got published.

[Artwork: ARYZ, Nelson Street]


Lost in time

I've been following the Occupy London debacle. And it is disaster, a PR mess for the City of London and the Church of England.

One thing about the UK is there are lots of historical things about. The castles, the iron-age hill forts, the straight roads build by the romans, the town names derived from latin. Some retain their grandeur, but have (somewhat) moved on with the times, like here: Oxford, whose curricular has been upgraded for new ideas (calculus), and whose buildings are now tourist attractions as well as a place of learning.
Oxford Colleges

Then there's the City of London. I've cycled through it a few times at the weekend: empty. Boring. Almost nobody lives there. What's interesting is the last fortnight has show how much of a relic of middle ages it is, how much power it has -and, therefore, how much power the business based there have.

The key points are
  • It has special rights w.r.t parliament and "the crown", including lobbying powers in parliament -a parliament which it considers itself mostly exempt from.
  • It's electorate is derived not just from the citizens, but from businesses and organisations based in the City. The citizens have the minority vote.
  • The process for becoming leader of the city is some process that doesn't even pretend to be democratic.
What it means today is this: the part of Britain which contains the headquarters of the most powerful business in the city is effectively self-governed by those business, and independent from the rest of the country.
Night Chill

Which is funny, as the People's Republic of Stokes Croft has taken up that "Passport to Pimlico" theme of an independent state within a city. They just do it as joke. Well, the joke is on them.


Reading: Availability in Globally Distributed Storage Systems

They are watching you

I am entertaining myself in test runs by reading Ford et al's paper,  Availability in Globally Distributed Storage Systems, from Google. This is a fantastic paper and it shows how large datacentre datasets themselves give researchers a great advantage, so it's good that we can all read it now it's been published.

Key points
  1. storage node failures are not independent
  2. most failures last less than 15 minutes, so the liveness protocol should be set up to not worry before then
  3. most correlated failures are rack-local, meaning whole racks, switches, rack UPSs fail.
  4. Managed operations (OS upgrades &c) cause a big chunk of outages.
The Managed Operations point means that ops teams need to think about how they upgrade systems in a way that don't cause replication storms.  Not directly my problem.

Point #3 is. The paper argues that if you store all copies of the data on different racks, you get far better resilience than storing multiple copies of the data on a single rack -that being exactly what Hadoop does. Hadoop has some hard-coded rules that say "two copies per rack are OK", which is done to save bandwidth on the backplane.

Now that  switches that offer fantastic backplane bandwidth are available from vendors like HP at prices that Cisco wouldn't dream of, rack locality matters less. What you save in bandwidth you lose in availability. Lose a single rack and you have to make 2x copies of every block that was created in that rack and that now only has one copy, and that is where you are vulnerable to data loss.

That needs to be fixed, either in 0.23 or its successor.


Hadoop work in the new API in Groovy.


I've been doing some actual Map/Reduce work with the new API, to see how it's changed. One issue: not enough documentation. Here then is some more, and different in a very special way: the code is in Groovy.

To use them: get the groovy-all JAR on your Hadoop classpath and use the groovyc compiler to compile your groovy source (and any java source nearby) into your JAR, bring up your cluster and submit the work like anything else.

This pair of operations, part of a test to see how well Groovy MR jobs work just counts the lines in a source file; about as simple as you can get.

The mapper:

package org.apache.example.groovymr

import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.mapreduce.Mapper

class GroovyLineCountMapper extends Mapper {

    final static def emitKey = new Text("lines")
    final static def one = new IntWritable(1)

    void map(LongWritable key,
             Text value,
             Mapper.Context context) {
        context.write(emitKey, one)

Nice and simple; little different from the Java version except
  • Semicolons are optional.
  • The line ending rules are stricter to compensate, hence lines end with a comma or other half-finished operation.
  • You don't have to be so explicit about type (the def declarations) -and let the runtime sort it out. I have mixed feelings about that.
There's one other quirk in that the Context parameter for the map operation (which is a generic type of the parent class) has to be explicitly declared as Mapper.Context. I have no idea why, except that it won't compile otherwise. The same goes for the Reducer.Context

Not much to see there then. What is more interesting is the reduction side of things.

package org.apache.example.groovymr
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapreduce.Reducer

class GroovyValueCountReducer 
        extends Reducer {

    void reduce(Text key,
                Iterable values,
                Reducer.Context context) {
        int sum = values.collect() { it.get() } .sum()
        context.write(key, new IntWritable(sum));

See the line in bold? It's taking the iterable of the list of values for that key, applying a closure to them (getting the values), which returns a list of type integer, which is then all summed up. That is: there is a per-element transform (it.get()) and a merging of all the results (sum()). Which is. when you think about it, almost a Map and a Reduce.

It's turtles all the way down.

[Artwork: Banksy, 2009 Bristol exhibition]

Microsoft and Hadoop: interesting

donkeys like music too

I am probably expected to say negative things about the recent announcement of Microsoft supporting Apache Hadoop(tm) on their Azure Cloud Platform, but I won't. I am impressed and think that it is a good thing.
  1. It shows that the Hadoop-ecosystem is becoming more ubiquitous. Yes it has flaws, which I know as one of my IntelliJ IDEA instances has the 0.24 trunk open in a window and I am staring at the monitoring2 code thinking "why didn't they use ByteBuffer here instead of hacking together XDR records by hand" (FWIW, Todd "ganglia" Lipcon says it aint his code). Despite these flaws, it is becoming widespread,
  2. That helps the layers on top, the tools that work to the APIs, the applications.
  3. This gives the Hadoop ecosystem more momentum and stops alternatives getting a foothold. In particular the LexisNexis stuff -from a company that talk about "Killing Hadoop". Got some bad news there...
  4. Microsoft have promised contribute stuff back. This is more than Amazon have ever done -yet AWS must have found truckloads of problems. Everyone else does: and we file bugs, then try to fix them. I could pick any mildly-complex source file in the tree, do a line-by-line code review and find something to fix, even if its just better logging or error handling. (don't dismiss error handling BTW
  5. If Amazon have forked, they get to keep that fork up to date.
  6. If MS do contribute stuff back, it will make Hadoop work properly under Windows. For now you have to install Cygwin because Hadoop calls out to various unix commands a lot of the time. A windows-specific library for these operations will make Hadoop not only more useful in Windows clusters, it will make it better for developers.
  7. MS will test at scale on Windows, which will find new bugs, bugs that they and Hortonworks will fix. Ideally they will add more functional tests too.
  8. I get to say to @savasp that my code is running in their datacentre. Savas: mine is the networking stuff to get it to degrade better on a badly configured home network.  Your ops team should not encounter this.
It's interesting that Microsoft have done this. Why?
  • It could be indicative of a low takeup of Azure outside the MS enterprise community. I used to do a lot of win32 programming (in fact I once coded on windows/386 2.04); I don't miss it, even though Visual Studio 6 used to be a really good C++ IDE. It is nicer to live in the Unix land that Kernighan and Ritchie created.(*)
  • Any data mining tooling encourages you to keep data, which earns money for all cloud service providers.
  • The layers on top are becoming interesting. That's the extra code layers, the GUI integration, etc.
  • There's no reason why enterprise customers can't also run Hadoop on windows server within their own organisations, so integrate with the rest of their world. (I'm ignoring cost of the OS here, because if you pay for RHEL6 and CDH then the OS costs become noise).
  • If you are trying to run windows code as part of your MR or Pig jobs, you now can. 
  • If you are trying to share the cluster with "legacy" windows code, you can.
Do I have any concerns?
  • Somehow I doubt MSFT will be eating their own dogfood here; this may reduce the rate they find problems, leaving it to the end users. Unless they have a fast upgrade rate it may take a while for those changes to roll out. (Look at AWS's update rate: sluggish. Maybe because they've forked)
  • To date, Hadoop is optimised for Linux; things like the way it execs() are part of this. There is a risk that changes for Windows performance will conflict with Linux performance. What happens then?
  • I forsee a growth in the out-of-depth people trying to use Hadoop and asking newbie questions now related to Windows. Though as we get them already, there may be no change.
  • I really wish Windows server had SSH built in rather than telnet. Telnet is dead: move on. We want SSH and SFTP filesystems, and an SFTP filesystem client in both Windows and OS/X. It's the only way to be secure.
  • I hope we don't end up in an argument over which underlying OS is best. The answer is: the one you are happy with.

(*) At least for developers. I changed the sprog's password last week as were unhappy with him, and when I passwd'd it back he asked me "why do I use the terminal?". One day he'll learn. I'll know that day as I he'll have changed my password.

[Artwork: unknown, Moon Street, Stokes Croft]


Oracle and Hadoop part 2: Hardware -overkill?

Montpelier Street Scene

[is the sequel to Part 1]

I've been looking at what Oracle say you should run Hadoop on and thinking "why?"

I don't have the full specs, since all I've seen is slideware on the register implying this is premium hardware, not just x86 boxes with as many HDDs you can fit in a 1U with 1 or 2 multicore CPUs and an imperial truckload of DRAM. In particular, there's mentioning of Infiniband in there.

InfiniBand? Why? Is it to spread the story that the classic location-aware schedulers aren't adequate on "commodity" 12-core 64GB 24TB with 10GbE interconnect? Or are there other plans afoot?

Well, one thing to consider is the lead time for new rack-scale products, some of this exascale stuff will predate the oracle takeover, and the design goals at the time "run databases fast" met Larry's needs more than the rest of the Sun product portfolio -though he still seems to dream of making a $ or two from every mobile phone on the planet.

The question for Oracle has been "how to get from hardware proto to shipping what they can say is the best Oracle server." Well, one tactic is to identify the hardware that runs Oracle better and stop supporting it. It's what they are doing against HP's servers, and will no doubt try against IBM when the opportunity arises. That avoids all debate about which hardware to run Oracle on. It's Oracle's. Next question? Support costs? Wait and see.

While all this hardware development was going on, the massive-low-cost GFS/HDFS filesystem with integrate compute was sneaking up on the sides. Yes, it's easy to say -as Stonebraker did- that MapReduce is a step backwards. But it scales, not just technically, but financially. Look at the spreadsheets here. Just as Larry and Hurd -who also seemed over-fond of the Data Warehouse story- are getting excited about Oracle on Oracle hardware, somewhere your data enters but never leaves(*), somebody has to break them the bad news that people have discovered an alternative way to store and process data. One that doesn't need ultra-high-end single server designs, one that doesn't need oracle licenses, and one that doesn't need you to pay for storage at EMC's current rates. That must have upset Larry, and kept him and the team busy on a response.

What they have done is defensive actions: Hadoop as a way of storing the low value data near Oracle RDBMS, for you to use it as the Extract-Transform-Load part of the story. Where it does fit in, as you can offload some of the grunge work to lower end machines, the storage to SATA. It's no different from keeping log data in HDFS but the high value data in HBase on HDFS, or -better yet IMO- HP Vertica.

For that story to work best, you shouldn't overspec the hardware with things like InfiniBand. So why has that been done?

  • Hypothesis 1: margins are better, helps stop people going to vendors (HP, Rackable), that can sell the servers that work best in this new world.
  • Hypothesis 2: Oracle's plans in the NoSQL world depend on this interconnect.
  • Hypothesis 3: Hadoop MR can benefit from it.

Is Hypothesis 3 valid? Well, in Disk-Locality in Datacenter Computing Considered Irrelevant [Ananthanarayanan2011], Ganesh and colleagues argue that improvements in in-rack and backplane bandwidth will mean you won't care whether your code is running on the same server as your data, or even the same rack as your data. Instead you will worry about whether your data is in RAM or on HDD, as that is what slows you down the most. I agree that even today on 10GbE rack-local is as fast as server-local, but if we could take HDFS out the loop for server-local FS access, that difference may reappear. And while eliminating Ethernet's classic Spanning Tree forwarding algorithm for something like TRILL would be great, it's still going to cost a lot to get a few hundred Terabits/s over the backplane, so in large clusters rack-local may still have an edge. If not, well, there's still the site-local issue that everyone's scared of dealing with today.

Of more immediate interest is Can High-Performance Interconnects Benefit Hadoop Distributed File System?, [Sur2010]. This paper looks at what happens today if you hook up a Hadoop cluster over InfiniBand. They showed it could speed things up, even more if you went to SSD. But go there and -today- you massively cut back on your storage capacity. It is an interesting read, though it irritates me that they fault HDFS for not using the allocateDirect feature of Java NIO, and didn't file a bug or fix for that. See a problem, don't just write a paper saying "our code is faster than theirs". Fix the problem in both codebases and show the speedup is still there -as you've just removed one variable from the mix.

Anyway, even with that paper, 10GbE looks pretty good, it'll be built in to the new servers and if the NIO can be fixed, it's performance may get even closer to InfiniBand. You'd then have to move to alternate RPC mechanisms to get the latency improvements that InfiniBand promises.

Did the Oracle team have these papers in mind when they did the hardware? Unlikely, but they may have felt that IB offers a differentiator over 10GbE. Which it does, in cost terms, limits of scale and complexity of bringing up a rack.

They'd better show some performance benefits for that -either in Hadoop or the NoSQL DB offering they are promising.

(*) People call this the "roach motel" model, but I prefer to refer to Northwick Park Hospital in NW London. Patients enter, but they never come out alive.


Oracle and Hadoop part 1

Under the M32

I suppose I should add a disclaimer that I am obviously biased against anything Oracle does, but I will start by praising them for:
  1. Recognising that the Hadoop platform is both a threat and an opportunity. 
  2. Doing something about it.
  3. Not rewriting everything from scratch under the auspices of some JCP committee.

Point #3 is very different from how Sun would work: they'd see something nice and try and suck it into the "Java Community Program", which would either get it all overweight from the weight that Standards Bodies apply to technology (EJB3 vs Hibernate), stall it until it becomes irrelevant (most things), or ruin a good idea by proposing a complete rewrite (Restlet vs JAX-RS). Overspecify the API, underspecify the failure modes and only provide access to the test suite if you agree to all of Sun's (and now Oracle's) demands.

No, they didn't go there. Which makes me worry: what is their plan. I don't see Larry taking to the idea of "putting all the data that belongs to companies and storing it a filesystem and processing platform that I don't own". I wonder if the current Hadoop+R announcement is a plan to get in to the game, but not all of it.

Or it could just be that they realised that if they invited Apache to join some committee on the topic they'd get laughed at. 

Sometime soon I will look at the H/W specs and wonder why that was so overspecified. Not today.

What I will say that irritated me about Oracle World's announcements was not Larry Ellison, it was Mark Hurd saying snide things about HP how Oracle's technologies were better. If that's the case, given the time to market of new systems, isn't a critique of Hurd himself? That's Mark "money spent on R&D is money wasted" Hurd? That's Mark whose focus on the next quarter meant that anything long term wasn't anything he believed in? That is Mark Hurd whose appointed CIO came from Walmart, and viewed any "unofficial" IT expenditure as something to clamp down on? Those of us doing agile and advanced stuff: complex software projects ended up using Tools that weren't approved: IntelliJ IDEA, JUnit tests, Linux desktops, Hudson running CI. The tooling we set up to get things done were the complete antithesis of the CIO's world view of one single locked down windows image for the entire company, a choice of two machines: "the approved laptop" and "the approved desktop". Any bit of the company that got something done had to go under the radar and do things without Hurd and his close friends noticing.

That's what annoyed me

(Artwork: something in Eastville. Not by Banksy, before anyone asks)


A note on distributed computing diagnostics

Sepr @ See No Evil

I've just stuck in my first mildly interesting co-authored Hadoop patch for a while, HADOOP-7466: "Add a standard handler for socket connection problems which improves diagnostics"

It tries to address two problems

Inadequate Diagnostics in the Java Runtime

Despite Java being "the language of the Internet" or whatever Sun used to call it, when you get any kind of networking problem (Connection Refused, No Route to Host, Unknown Host, Socket Timed Out), the standard Java exception messages don't bother to tell you which host isn't responding, what port is refusing connections or anything else useful. In a room with 2000 machines, it's not that useful to know that one of them can't talk to another. You need to know which machine is having problems, what other machine it is trying to talk to, and whether its the HDFS level or something above. But no, the exception text never gets any better, whoever wrote them didn't read Waldo's A Note on Distributed Computing and think that if two machines are near each other nothing can possibly go wrong.

Whatever they were thinking, if they tried to submit exception messages like that to the Hadoop codebase today, the review process would probably bounce them back with a "make this vaguely useful". The patch tries to fix this by taking the exception and the (hostname, port) of the source and destination (if known), and then includes these details in the exception text. This helps people like me know what's gone wrong with our configuration and/or network.

Inadequate understanding of the fundamental network error messages

This is something I despair of. There are people out there that haven't done enough homework to know what a ConnectionRefused exception means, and ask for help when they see it. Again, again and again. Same for all the other common error messages.

The people who are trying to set up Hadoop clusters who don't yet know what these error messages are in way out of their depth. That should be an appendix to Waldo's paper: the many layers of historical code underneath are not transparent; it helps to have read Tanenbaum's "Computer Networking" book, it helps to spend some time writing code at the socket layer, just to understand what goes wrong at that level. Trying to download the Hadoop artifacts and then push it out to a small set of machines without this basic knowledge is dooming these people to days of confusion, which inevitably propagates to the mailing lists and bug trackers. Usually someone posts a stack trace to the -user and -dev lists, then starts repeating it every hour until someone answers; the total cost of wasted time is surprisingly high.

The patch, therefore, also add references to the wiki pages, for ConnectionRefused, UnknownHost, NoRouteToHost, BindExceptionSocketTimeout. All of which list some possible causes, and some tips on debugging the problem. And also say : this is your network, your configuration, you are going to have to fix it yourself.

Will it stop people asking for help? Unlikely. But it may get them learn what the messages mean, and why it is a problem on their side. Because it's not my problem.

[Artwork by Sepr]


H work

This system needs a push over the edge

On my todo list, then is to catch up with what's going on Hadoop, get some of my minor issues checked in and get involved with the fun stuff in Trunk. That includes the 0.23 YARN stuff, but I'm also starting to think there are some data integrity risks that ought be addressed.

First step: building everything. I am particularly excited to see that Hadoop-trunk now requires me to download and build protocol buffers [README]. I shall be updating the relevant wiki pages so that I can remember what to do on the other machines.


301 Moved Permanently

I've moved my blog from 1060.org here. Why? Well, the team at 1060research have pushed out a new release of their NetKernel product, on which the blog was running, and if I wanted to retain the URLs I'd have to upgrade the code myself.  Being lazy and all, I opted not to.

What have I been up to since going offline
  1. On twitter @steveloughran. Idle chatter.
  2. Finishing up some major project at work that has kept be busy for the past 12-18 months. I am feeling more relaxed now. 
  3. Coding in Groovy. It's like Java only better, and trivial to switch between the too. There's great IDE support in IntelliJ IDEA too.
  4. Doing some proper Computer Science stuff, as opposed to Software Engineering.
  5. Paper Reading. This may seem dull, but there is a lot of interesting stuff out there. If you spend too much time knee-deep in various projects' codebases you get sucked into various issues (log4j configuration etc), and out of touch with higher level problems. 
  6. Holiday on the south coast of England. Not far; lazy. I lost a camera, which is a pity, but I've replaced it already.
My most recent set of readings was all the big datacentre papers on DRAM and HDD failures. A good way to revise on concepts like Poisson Distributions, Gauss Distributions, Weibull Distributions (which I never knew of before), and lots of other math-hard problems. It's shocking how much maths I have forgotten; I keep seeing these bits on the paper where they do integration or differentiation and say "clearly then" and the words mean nothing to me. I will have to revise some fundamentals.

With the work I've been doing wrapping up, I'm hoping to get more involved in Hadoop and Hadoop related work. I don't have a schedule for that -I'm just reading now.