Steve Loughran: 2011-10

2011-10-19

Reading: Availability in Globally Distributed Storage Systems

I am entertaining myself in test runs by reading Ford et al's paper, Availability in Globally Distributed Storage Systems, from Google. This is a fantastic paper and it shows how large datacentre datasets themselves give researchers a great advantage, so it's good that we can all read it now it's been published.

Key points

storage node failures are not independent
most failures last less than 15 minutes, so the liveness protocol should be set up to not worry before then
most correlated failures are rack-local, meaning whole racks, switches, rack UPSs fail.
Managed operations (OS upgrades &c) cause a big chunk of outages.

The Managed Operations point means that ops teams need to think about how they upgrade systems in a way that don't cause replication storms. Not directly my problem.

Point #3 is. The paper argues that if you store all copies of the data on different racks, you get far better resilience than storing multiple copies of the data on a single rack -that being exactly what Hadoop does. Hadoop has some hard-coded rules that say "two copies per rack are OK", which is done to save bandwidth on the backplane.

Now that switches that offer fantastic backplane bandwidth are available from vendors like HP at prices that Cisco wouldn't dream of, rack locality matters less. What you save in bandwidth you lose in availability. Lose a single rack and you have to make 2x copies of every block that was created in that rack and that now only has one copy, and that is where you are vulnerable to data loss.

That needs to be fixed, either in 0.23 or its successor.

2011-10-14

Hadoop work in the new API in Groovy.

I've been doing some actual Map/Reduce work with the new API, to see how it's changed. One issue: not enough documentation. Here then is some more, and different in a very special way: the code is in Groovy.

To use them: get the groovy-all JAR on your Hadoop classpath and use the groovyc compiler to compile your groovy source (and any java source nearby) into your JAR, bring up your cluster and submit the work like anything else.

This pair of operations, part of a test to see how well Groovy MR jobs work just counts the lines in a source file; about as simple as you can get.

The mapper:

package org.apache.example.groovymr

import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.mapreduce.Mapper

class GroovyLineCountMapper extends Mapper {

    final static def emitKey = new Text("lines")
    final static def one = new IntWritable(1)

    void map(LongWritable key,
             Text value,
             Mapper.Context context) {
        context.write(emitKey, one)
    }
}

Nice and simple; little different from the Java version except

Semicolons are optional.
The line ending rules are stricter to compensate, hence lines end with a comma or other half-finished operation.
You don't have to be so explicit about type (the def declarations) -and let the runtime sort it out. I have mixed feelings about that.

There's one other quirk in that the Context parameter for the map operation (which is a generic type of the parent class) has to be explicitly declared as Mapper.Context. I have no idea why, except that it won't compile otherwise. The same goes for the Reducer.Context

Not much to see there then. What is more interesting is the reduction side of things.

package org.apache.example.groovymr
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapreduce.Reducer

class GroovyValueCountReducer 
        extends Reducer {

    void reduce(Text key,
                Iterable values,
                Reducer.Context context) {
        int sum = values.collect() { it.get() } .sum()
        context.write(key, new IntWritable(sum));
    }
}

See the line in bold? It's taking the iterable of the list of values for that key, applying a closure to them (getting the values), which returns a list of type integer, which is then all summed up. That is: there is a per-element transform (it.get()) and a merging of all the results (sum()). Which is. when you think about it, almost a Map and a Reduce.

It's turtles all the way down.

[Artwork: Banksy, 2009 Bristol exhibition]

Microsoft and Hadoop: interesting

I am probably expected to say negative things about the recent announcement of Microsoft supporting Apache Hadoop(tm) on their Azure Cloud Platform, but I won't. I am impressed and think that it is a good thing.

It shows that the Hadoop-ecosystem is becoming more ubiquitous. Yes it has flaws, which I know as one of my IntelliJ IDEA instances has the 0.24 trunk open in a window and I am staring at the monitoring2 code thinking "why didn't they use ByteBuffer here instead of hacking together XDR records by hand" (FWIW, Todd "ganglia" Lipcon says it aint his code). Despite these flaws, it is becoming widespread,
That helps the layers on top, the tools that work to the APIs, the applications.
This gives the Hadoop ecosystem more momentum and stops alternatives getting a foothold. In particular the LexisNexis stuff -from a company that talk about "Killing Hadoop". Got some bad news there...
Microsoft have promised contribute stuff back. This is more than Amazon have ever done -yet AWS must have found truckloads of problems. Everyone else does: and we file bugs, then try to fix them. I could pick any mildly-complex source file in the tree, do a line-by-line code review and find something to fix, even if its just better logging or error handling. (don't dismiss error handling BTW
If Amazon have forked, they get to keep that fork up to date.
If MS do contribute stuff back, it will make Hadoop work properly under Windows. For now you have to install Cygwin because Hadoop calls out to various unix commands a lot of the time. A windows-specific library for these operations will make Hadoop not only more useful in Windows clusters, it will make it better for developers.
MS will test at scale on Windows, which will find new bugs, bugs that they and Hortonworks will fix. Ideally they will add more functional tests too.
I get to say to @savasp that my code is running in their datacentre. Savas: mine is the networking stuff to get it to degrade better on a badly configured home network. Your ops team should not encounter this.

It's interesting that Microsoft have done this. Why?

It could be indicative of a low takeup of Azure outside the MS enterprise community. I used to do a lot of win32 programming (in fact I once coded on windows/386 2.04); I don't miss it, even though Visual Studio 6 used to be a really good C++ IDE. It is nicer to live in the Unix land that Kernighan and Ritchie created.(*)
Any data mining tooling encourages you to keep data, which earns money for all cloud service providers.
The layers on top are becoming interesting. That's the extra code layers, the GUI integration, etc.
There's no reason why enterprise customers can't also run Hadoop on windows server within their own organisations, so integrate with the rest of their world. (I'm ignoring cost of the OS here, because if you pay for RHEL6 and CDH then the OS costs become noise).
If you are trying to run windows code as part of your MR or Pig jobs, you now can.
If you are trying to share the cluster with "legacy" windows code, you can.

Do I have any concerns?

Somehow I doubt MSFT will be eating their own dogfood here; this may reduce the rate they find problems, leaving it to the end users. Unless they have a fast upgrade rate it may take a while for those changes to roll out. (Look at AWS's update rate: sluggish. Maybe because they've forked)
To date, Hadoop is optimised for Linux; things like the way it execs() are part of this. There is a risk that changes for Windows performance will conflict with Linux performance. What happens then?
I forsee a growth in the out-of-depth people trying to use Hadoop and asking newbie questions now related to Windows. Though as we get them already, there may be no change.
I really wish Windows server had SSH built in rather than telnet. Telnet is dead: move on. We want SSH and SFTP filesystems, and an SFTP filesystem client in both Windows and OS/X. It's the only way to be secure.
I hope we don't end up in an argument over which underlying OS is best. The answer is: the one you are happy with.

(*) At least for developers. I changed the sprog's password last week as were unhappy with him, and when I passwd'd it back he asked me "why do I use the terminal?". One day he'll learn. I'll know that day as I he'll have changed my password.

[Artwork: unknown, Moon Street, Stokes Croft]

2011-10-10

Oracle and Hadoop part 2: Hardware -overkill?

[is the sequel to Part 1]

I've been looking at what Oracle say you should run Hadoop on and thinking "why?"

I don't have the full specs, since all I've seen is slideware on the register implying this is premium hardware, not just x86 boxes with as many HDDs you can fit in a 1U with 1 or 2 multicore CPUs and an imperial truckload of DRAM. In particular, there's mentioning of Infiniband in there.

InfiniBand? Why? Is it to spread the story that the classic location-aware schedulers aren't adequate on "commodity" 12-core 64GB 24TB with 10GbE interconnect? Or are there other plans afoot?

Well, one thing to consider is the lead time for new rack-scale products, some of this exascale stuff will predate the oracle takeover, and the design goals at the time "run databases fast" met Larry's needs more than the rest of the Sun product portfolio -though he still seems to dream of making a $ or two from every mobile phone on the planet.

The question for Oracle has been "how to get from hardware proto to shipping what they can say is the best Oracle server." Well, one tactic is to identify the hardware that runs Oracle better and stop supporting it. It's what they are doing against HP's servers, and will no doubt try against IBM when the opportunity arises. That avoids all debate about which hardware to run Oracle on. It's Oracle's. Next question? Support costs? Wait and see.

While all this hardware development was going on, the massive-low-cost GFS/HDFS filesystem with integrate compute was sneaking up on the sides. Yes, it's easy to say -as Stonebraker did- that MapReduce is a step backwards. But it scales, not just technically, but financially. Look at the spreadsheets here. Just as Larry and Hurd -who also seemed over-fond of the Data Warehouse story- are getting excited about Oracle on Oracle hardware, somewhere your data enters but never leaves(*), somebody has to break them the bad news that people have discovered an alternative way to store and process data. One that doesn't need ultra-high-end single server designs, one that doesn't need oracle licenses, and one that doesn't need you to pay for storage at EMC's current rates. That must have upset Larry, and kept him and the team busy on a response.

What they have done is defensive actions: Hadoop as a way of storing the low value data near Oracle RDBMS, for you to use it as the Extract-Transform-Load part of the story. Where it does fit in, as you can offload some of the grunge work to lower end machines, the storage to SATA. It's no different from keeping log data in HDFS but the high value data in HBase on HDFS, or -better yet IMO- HP Vertica.

For that story to work best, you shouldn't overspec the hardware with things like InfiniBand. So why has that been done?

Hypothesis 1: margins are better, helps stop people going to vendors (HP, Rackable), that can sell the servers that work best in this new world.
Hypothesis 2: Oracle's plans in the NoSQL world depend on this interconnect.
Hypothesis 3: Hadoop MR can benefit from it.

Is Hypothesis 3 valid? Well, in Disk-Locality in Datacenter Computing Considered Irrelevant [Ananthanarayanan2011], Ganesh and colleagues argue that improvements in in-rack and backplane bandwidth will mean you won't care whether your code is running on the same server as your data, or even the same rack as your data. Instead you will worry about whether your data is in RAM or on HDD, as that is what slows you down the most. I agree that even today on 10GbE rack-local is as fast as server-local, but if we could take HDFS out the loop for server-local FS access, that difference may reappear. And while eliminating Ethernet's classic Spanning Tree forwarding algorithm for something like TRILL would be great, it's still going to cost a lot to get a few hundred Terabits/s over the backplane, so in large clusters rack-local may still have an edge. If not, well, there's still the site-local issue that everyone's scared of dealing with today.

Of more immediate interest is Can High-Performance Interconnects Benefit Hadoop Distributed File System?, [Sur2010]. This paper looks at what happens today if you hook up a Hadoop cluster over InfiniBand. They showed it could speed things up, even more if you went to SSD. But go there and -today- you massively cut back on your storage capacity. It is an interesting read, though it irritates me that they fault HDFS for not using the allocateDirect feature of Java NIO, and didn't file a bug or fix for that. See a problem, don't just write a paper saying "our code is faster than theirs". Fix the problem in both codebases and show the speedup is still there -as you've just removed one variable from the mix.

Anyway, even with that paper, 10GbE looks pretty good, it'll be built in to the new servers and if the NIO can be fixed, it's performance may get even closer to InfiniBand. You'd then have to move to alternate RPC mechanisms to get the latency improvements that InfiniBand promises.

Did the Oracle team have these papers in mind when they did the hardware? Unlikely, but they may have felt that IB offers a differentiator over 10GbE. Which it does, in cost terms, limits of scale and complexity of bringing up a rack.

They'd better show some performance benefits for that -either in Hadoop or the NoSQL DB offering they are promising.

(*) People call this the "roach motel" model, but I prefer to refer to Northwick Park Hospital in NW London. Patients enter, but they never come out alive.

2011-10-08

Oracle and Hadoop part 1

I suppose I should add a disclaimer that I am obviously biased against anything Oracle does, but I will start by praising them for:

Recognising that the Hadoop platform is both a threat and an opportunity.
Doing something about it.
Not rewriting everything from scratch under the auspices of some JCP committee.

Point #3 is very different from how Sun would work: they'd see something nice and try and suck it into the "Java Community Program", which would either get it all overweight from the weight that Standards Bodies apply to technology (EJB3 vs Hibernate), stall it until it becomes irrelevant (most things), or ruin a good idea by proposing a complete rewrite (Restlet vs JAX-RS). Overspecify the API, underspecify the failure modes and only provide access to the test suite if you agree to all of Sun's (and now Oracle's) demands.

No, they didn't go there. Which makes me worry: what is their plan. I don't see Larry taking to the idea of "putting all the data that belongs to companies and storing it a filesystem and processing platform that I don't own". I wonder if the current Hadoop+R announcement is a plan to get in to the game, but not all of it.

Or it could just be that they realised that if they invited Apache to join some committee on the topic they'd get laughed at.

Sometime soon I will look at the H/W specs and wonder why that was so overspecified. Not today.

What I will say that irritated me about Oracle World's announcements was not Larry Ellison, it was Mark Hurd saying snide things about HP how Oracle's technologies were better. If that's the case, given the time to market of new systems, isn't a critique of Hurd himself? That's Mark "money spent on R&D is money wasted" Hurd? That's Mark whose focus on the next quarter meant that anything long term wasn't anything he believed in? That is Mark Hurd whose appointed CIO came from Walmart, and viewed any "unofficial" IT expenditure as something to clamp down on? Those of us doing agile and advanced stuff: complex software projects ended up using Tools that weren't approved: IntelliJ IDEA, JUnit tests, Linux desktops, Hudson running CI. The tooling we set up to get things done were the complete antithesis of the CIO's world view of one single locked down windows image for the entire company, a choice of two machines: "the approved laptop" and "the approved desktop". Any bit of the company that got something done had to go under the radar and do things without Hurd and his close friends noticing.

That's what annoyed me

(Artwork: something in Eastville. Not by Banksy, before anyone asks)