Steve Loughran: 2012-12

2012-12-21

Death by Snow

The NY Times has one of the most beautiful HTML5 web articles to date, Snow Fall.

Beyond the shine, the story is about a group of skiiers getting avalaunched on a gully on the back side of Steven's Canyon ski resort in the Cascades. The Cascades being famed and treasured for the large volumes of (heavy) snow that it can get dumped on in a 24 hour period -a metre a day for a number of days in a row sometimes.

In the story, a group of skiers went down the "tunnel creek" after fresh snowfall onto that from two weeks previously.

By morning, there would be 32 inches of fresh snow at Stevens Pass, 21 of them in a 24-hour period of Saturday and Saturday night.
That was cause for celebration. It had been more than two weeks since the last decent snowfall. Finally, the tired layer of hard, crusty snow was gone, buried deep under powder.

Given that you know this a new article where the outcome is not good, you can look at that say "about a metre of fresh snow on a layer which would have frozen together on the surface over the previous two weeks", and immediately conclude what's going to happen: that fresh snow isn't going to bond to the previous layer, creating a shear point that's just waiting to trigger.

That doesn't make the rest of the story any better -it's a brutal documentary of what happens when snow does what it often does after a big snowfall: slides down the mountain.

Off piste skiing isn't skiing, it's winter/spring mountaineering with skis on. Skis that give you speed, but also bias you towards going on the snowy areas, not the rocky bits. Usually it can be great fun -but it puts you right where avalanches happen.

This article is awful for anyone to read -but if you've been into winter and/or ski mountaineering it's worse: its a documentary of what's happened to friends of yours, and what could happen to you.

[photo, ski randonne work in Belledone Range, French Alps, 1994? Skis: Volkl. Camera Canon. Film and Paper: Ilford ]

2012-12-17

Sorry: I ignore LinkedIn requests from people I don't know

This is an update of my existing policy: I tend to ignore LinkedIn requests from people I don't know.

If you have been sent a link to this page after you extended an invitation to connect to me on LI, then sorry, it appears you've fallen in to this category. This may be because

I don't know you. As I use I use LinkedIn primarily as an email address book, adding your email address to only creates confusion for me later on.
You are an HR recruiting person who hasn't read my critiques of Hadoop recruiting strategies. LI is not the place to find me; trying to connect to me on LI without even paying for a premium account doesn't make you look serious about recruiting -and doesn't benefit the Hadoop ecosystem. And I'm having fun at Hortonworks, so approaching me is a waste of time unless you want your plans made public.
We have met, I have just completely forgotten about it as I am better at remembering email addresses than names or faces.

If it's option #3: please retry with some better context than the stock "I'd like to add you to my professional network on LinkedIn".

If it's options 1 or 2, LinkedIn is not the way to approach me. I am not trying to build up a vast network -I primarily use it as my address book for people I've worked with on Apache projects, or other people I've worked with. Not as a way of keeping a list of people I don't know.

As I've stated before, LinkedIn actually measures the accept:reject ratio of invitation requests. If I accept invitations from people I don't know, that devalues all my other links and does their graph no benefit at all.

Sorry.

[photo: xmas 2012 graffiti off stokes croft]

2012-12-10

why you should vote for "Hadoop: Embracing Future Hardware"

At some point in the next 10-15 years, the last "rotating iron" hard disk will be made.

That's a profound thought. Admittedly, I may get the date wrong, but the point remains. Just as the CRT, the floppy drive and the CD has gone away, hard disks will become a rarity.

Who cares? Those of us building the future Hadoop platforms do.

GFS& MapReduce, Hadoop HDFS and its MR Engine, are all designed to take advantage of "commodity hardware". That means rather than pay for top of the line Itanium, PowerPC or Sparc servers running a Sysv-derived Unix, they use servers built from x86 parts running Linux. This is not because of any ideological support of the x86 architecture: nobody who has ever written x86 assembler or debugged win32 C++ apps at that level will be fond of the x86. No, x86 parts were chosen as they were the servers with the most cost effective performance, a manageable power budget (compared to Itanium) and because people made servers with them on board.

And why are x86 parts so cost effective -even though they have so many millions of transistors Because Intel have managed to take the revenue from each generation of parts into funding the R&D work and new fabs needed for the next generation of CPU parts and the processes to manufacture them.

It is the mass consumer and corporate demand for PC desktops that has given us affordable high-performance x86 parts,

Even if the Xeon stuff doesn't work in the desktop, the fabs and the core design are shared -the volumes kept the cost down.

With the emergence of phones and tablets as the new consumer internet access point, sales of PC parts are flatlining, and may decrease in future. Our home PC is used as a store for photographs and a device for a ten year old to play minecraft or -or to watch youtube videos of minecraft. He isn't committed to intel parts, and as for the photgraphs, well, 1TB of cloud storage isn't affordable -yet- but that may change. And when your phone can upload directly to facebook, why faff around downloading things to a local PC?

Even enterprise PCs are changing, they are called "laptops" and SSD storage is moving down from the "ultrabook" class of devices to becoming mainstream -at a guess within 3-5 years they'll be SSD everywhere.

The world of end user devices are changing -which is going to have implications for servers. We need to look at those trends and start planning ahead, not just to handle the "what happens when HDDs go away" problem, but "how can we make best use of these new parts in 18-24 months?

Which brings me round to the whole point of this article: my other talk is Hadoop: Embracing Future Hardware,

Vote for it. If not, you'll be taken by surprise when the future happens around you while you weren't looking.

[Photo: something from the harbourfest , 2008l]

2012-12-07

Why "Taking Hadoop to the Clouds" is the talk to vote for

The Hadoop summit vote list is up, and I have two proposals -currently undervoted. Even though I'm on the review committee for the futures strand, not even I could push through a talk which had zero votes on it -ideally I'd like my talks to get in through popular acclaim. I could just create 400 fake email addresses and vote-stuff that way, but I'm lazy.

For that reason, I'm going to talk in detail about why my talks will be so excellent that to even think about having them left out could be detrimental to the entire conference.

One of my talks is "Taking Hadoop to the Clouds".

There are two competitors

Deploying Hadoop in the Cloud, which looks at options, details and best practices. I don't see anything particularly compelling in the abstract -I assume it's got more votes as it's the one that comes up first. Or they are trying the many-email-address-vote-stuffing technique(*).
How to Deploy Hadoop Applications on Any Cloud & Optimize Price Performance. This could be interesting, as it covers how CliQr deploys Hadoop on different infrastructures. It sounds like a rackable-style orchestraction layer above infrastructures, for Hadoop it may have similarities with MastodonC's Kixi work,

Why then, should people vote for mine?

I'm giving the talk.

This is not me being egocentrically smug about the quality of my presentations, but because I'm reasonably confident I know a lot about the area.

My last time at HP Labs was spent on the implementation of the "Cells" virtual infrastructure: declarative configuration of the entire cluster design. The details were presented at the 5th IEEE/ACM conference on Utility and Cloud Computing, and will no doubt be in the ACM library. This means I know about IaaS implementation details; the problems of placement, why networking behaves the way it does, image management, what UIs could look like, what the APIs could be, etc.
I've spent a lot of time publicly making Hadoop cloud-friendly. I presume that MS Azure and AWS ElasticMR have put in more hours, but unless they're going to talk about their work, Tom White and myself are the next choices. Jun Ping and VMWare colleagues have done a lot too -and big patches into the codebase, but I don't see any submissions from them.
I have opinions on the matter. They aren't clear cut "cloud good/physical bad" or "physical bad/cloud good". There are arguments either way; it depends on what you want to do, what your data volume is, and where it lives.
I'm still working in the area, in Hadoop itself and the code nearby.

Recent cloud-related activities include

HADOOP-8545: a Swift Filesystem driver for OpenStack. This is something everyone running Hadoop on Rackspace or other OpenStack clusters will want. This week two different implementations have surfaced, getting them merged together is going to be the next activity,
WHIRR-667: Add whirr support for HDP-1 installation
Ambari with Whirr. Proof of concept more than anything else.
Jclouds and Rackspace UK throttling. Adrian Cole managed to reduce the impact of issue-549, which is good as I don't really want to get sucked into a different OSS codebase,
Other things that I'm not going to talk about -yet.

That's why people should vote for me. The other talks will be about "how we got Hadoop to work in a virtual world" -mine will be about how we improved Hadoop to work in a virtual world.

(*) ps, for anyone planning the many-email-accounts approach, remember that the email addresses are something we reviewers can look at, and many sequential accounts all doing three votes to a single talk will show up as "statistically significant". Russ has the data, he likes his analyses. He may even have the IP addresses.

[Photo: an interview with Page 6 Guy at ApacheCon]

2012-12-05

An Intro to Contributing to Hadoop

Together the ants shall conquer the elephant

Jeff Bean of Clouder has stuck up a video on contributing to Hadoop, which is a reasonable introduction to JIRA-centric development.

Process-wise, there's a few things I'd add:

Search for the issue or feature before you file a new bug.The first line of a stack trace is a great search term, though it's a bit depressing to find the only other person to find it was yourself 18 months earlier, and you never fixed in then either.
It's harder to get committer rights on Hadoop than most other projects, because the barrier to effort and competence is high. You pretty much have to work full time on the project. Posting four JIRAs and then asking to get committer access is unrealistic. And it doesn't bring much to the table except bragging rights.
The bit at 16:20 where Jeff said "email other contributors to get eyes" was in fact an error. He meant to say "email wittenauer to get constructive feedback on your ideas" -nobody else welcomes such emails, and actually talking on the -dev list is better.
I'd also emphasise the "watch issue" button. If there is something you care about, hit the watch button to get emails whenever it is updated.
When you file a bug, include stack traces, kill -QUIT thread dumps, nestat and lsof details for the process in question; anything else. NOT: JPG screen shots of your Dos console. That flags up that you are probably out your depth when it comes to getting JAVA_HOME set, let alone discussing the impact of VM clock drift on consensus protocol-based distributed journalling systems.
When you file your bug, your rating: critical, major, etc, differs from everyone else. Mine are normally minor or trivial. If they only affect you: minor. Easy to fix: trivial.
Don't file bugs about "I couldn't get Hadoop to install". Those bugs will be closed as invalid; posts on it to the -dev lists silently ignored. Go to the user lists.

I was a bit disappointed by the claim that "the apache artifacts aren't stable, you need CDH" and the message that there is "the community" and "cloudera engineers", the latter being the only people who make Hadoop enterprise-ready. As well as Hortonworks, there are companies like IBM, Microsoft and VMWare working on making sure their customers' needs are met -and testing the Apache releases to make sure they're up to a state where you can use them in production.(*)

This "we are the engineers" story falls over at 07:00 when the walk through of the (epic) HA NN work, my colleagues Sanjay, Suresh and Jitendra all get a mention. Because Hadoop is a community project -one that involves multiple companies working together on Hadoop -as well as individuals and small teams. The strength of the Hadoop codebase comes from the combined contributions from everyone. Furthermore, having a no-single-vendor open source project, with public artifacts you can pick up and use, adds a strategic advantage to that codebase. Hadoop is not MySQL or OpenJDK -open source with secret bits that the single vendor can charge for. There's a cost to that -more need to develop a consensus, which is why I encourage people using Hadoop in production systems to get on the -dev lists, regardless of how Hadoop gets to your servers. Participation in those discussions gives you a direct say in the future direction of the project.

Overall though, not a bad intro to how to get started in the development. It makes me think I should do a video of my intro to hadoop-dev slides, which looks less at JIRA and more about why the development process is as it is, and how we could improve it. Someone else can do the "why Maven is considered a good tool for releasing Hadoop" talk -all I know is that I have to to a "mvn install -DskipTests" every morning to stop maven trying to go to the apache snapshot repo to download other people's artifacts, instead of the ones I build the day before.

(*) Yes, I know that Hadoop 1.1.1 is being replaced with a 1.1.2 to backport a deadlock show-stopper, but that's a very rare case -and shows that we do react to any problem in the stable branch that is considered serious.

[Photo, "together the ants shall conquer the elephant", alongside the M32 in Easton].