Steve Loughran

2013-06-30

Hoya: HBase on YARN

I didn't go to the Hadoop Summit, though it sounds really fun. I am having lots of of fun at a Big Data in Science workshop at Imperial college instead, where problems like "will the code to process my data still work in 50y", as well as the problems that the Square Kilometre Array will have (10x the physics dataset, sources across a desert, datacentre in the desert too)

What did make it over to the Summit is some of my code, the latest of which is Hoya, HBase on YARN. I have been busy coding this for the last four weeks:

Having the weather nice enough to work outside is lovely. Sadly, the wifi signal there is awful, which doesn't matter until I need to do maven things, where I have to run inside and hold the laptop vertically beneath the base station two floors above.

It's not that readable, but up on my display is the flexing code in Hoya: the bit in the AM that handles a request from the client to add or remove nodes. It's wonderfully minimal code, all it does is compare the (possibly changed) value of worker nodes wanted with the current value, and decides whether to ask the RM for some more nodes (using the predefined memory requirements of a Region Server), or to release nodes -in which case the RM will kill the RS, leaving the HBase master to notice this and handle the lost.

Asking for more nodes leaves the YARN RM to satisfy it, when then calls back to the RM saying "here they are". At which point Hoya sets up a launch request containing references to all the config files and binaries that need to go to the target machine, and a command line that is the hbase command line. There is no need for a Hoya-specific piece of code running on every worker node; YARN does all the work there.

Some other aspects of Hoya for the curious

Hoya can take a reference to a pre-installed HBase instance, one installed by management tools such as Ambari, or kickstart installed into all the hosts. Hoya will ignore any template configuration file there, pushing out its own conf/ dir under the transient YARN-managed directories, pointing HBase at it.
Although hbase supports multiple masters, Hoya just creates a single master exec'd off the Hoya AM. All but the live HBase master are simply waiting for ZK to give them a chance to go live -they're there for failure recovery. It's not clear we need that, not if YARN restarts the AM for us.
Hoya remembers its cluster details in a ~/.hoya/clusters/${clustername} directory, including the HBase data, a snapshot of the configuration, and the JSON file used to specify the cluster. You can machine-generate the cluster spec if you want.
The getClusterStatus() AM API call returns a JSON description of the live cluster, in the same JSON format. It just adds details about every live node in the cluster. It turns out that classic Hadoop RPC has a max string size of <32K, so I'll need to rework that for larger clusters, or switch to protobuf, but the idea is simple: the same JSON structure is used for both the abstract specification of the cluster, and the description of the instantiated cluster. Some former colleagues will be noting that's been done before, to which the answer is "yes, but this is simpler and with a more structured format, as well as no cross-references".
I've been evolving the YARN-679 "generic service entry point" for starting both the client and services. This instantiates the service named on the command line, hooking up signal handling to stop it. It then invokes -if present- an interface method, int runService() to run the service, exiting with the given error code. Oh, and it passes down the command line args (after extracting and applying conf file references and in-line definitions from it), before Service.init(Config) is called. This entry point is designed to eliminate all the service-specific entry points, but also provide some in-code access points to -letting you use it to create and run a service from your own code, passing in the command line args as a list/varags. I used that a lot in my tests, but I'm not yet sure the design is right. Evolution and peer-review will fix that.

Developing against YARN in its last few weeks of pre-beta stabilisation was entertaining -there was a lot of change. A big piece of it -YARN-117, was my work; getting it in meant that I could switch from the fork I was using to that branch, after which I was updating hadoop/branch-2; patching my code to fix any compile issues, retesting every morning. Usually: seamless; one day it took me until mid-afternoon for all to work, an auth-related patch on Saturday stopped test clusters working until Monday. Vinod was wonderfully help here, as was Devaraj with testing 50+ node clusters. Finally, on the Tuesday the groovyc support in Maven stopped working for all of us in the EU who caught an incompatible dependency upgrade first. To their credit the groovy dev team responded fast there, not only with a full fix out by the end of the day, but with some rapid suggestions on how to get back to a working build. It's just as you are trying to get something out for a public event, these things always hit your schedule: plan for them.

Also: Hoya is written in a mix of Java (some of the foundational stuff), and Groovy -all tests and the AM & client themselves. This was my second attempt at a Groovy YARN app, "Grumpy" being my first pass back in Spring 2012, during my break between HPLabs and Hortonworks. That code was never finished and too out of date to bother with; I started with the current DistributedShell example and used that -while tracking changes made to there during the pre-beta phase and pulling it over. The good news: a big goal of Hadoop 2.1 is stable protobuf-based YARN protocols, stable classes to help.

Anyway, Hoya works as a PoC, we should be letting out for people to play with soon. As Devaraj has noted: we aren't committed to sticking with Groovy. While some features were useful: lists, maps, closures, and @CompileStatic finds problems fast as well as speeding up code, it was a bit quirky and I'm not sure it was worth the hassle. For other people about to YARN apps, have a look at Continuuity Weave and see if that simplifies things .

P.S: we are hiring.

2013-05-17

Tilehurst? Where is Tilehurst and why does google maps care about it?

Google are being asked hard questions in Parliament about their UK tax setup.

I think the politicians are missing an opportunity to ask them the question that I'm always wondering: where is Tilehurst and why does google maps think it is so special.

Here is a google maps view of the UK

It has Bristol on it, but not Portsmouth or Cardiff. Its a always a mystery in Bristol while Pompey gets a dot on the BBC weather map, as does BRS's nearby rival, Cardiff. In the google map, Edinburgh and Manchester are the ones being left out.

But that is nothing compared to the Tilehurst question. Specifically : why?

Look what happens when you click to zoom in one notch.

Edinburgh exists, along with pretty much everything north of their excluding Mallaig, which is something all visitors to Scotland should do when laying out an itinerary.

And what is there between Bristol and London. One town merits a mention. Tilehurst.

Apart from this mention of Tilehurst, I have no data on whether or not this town actually exists. It's not on any motorway exits on the M4, no train stations, no buses from Bristol. I have never heard it mentioned in any conversation whatsoever.

Why then does Google Maps think that it is more important than, say, Reading, which meets all of the above criteria (admittedly, never in conversations that speak positively of it), Oxford, which people outside the UK have heard of.

No, Tilehurst it is.

It could be some bizarre quirk of the layout algorithm that picks a random place ignoring things like nearby population numbers or using M-way exit signs, mentions in pagerank or knowledge of public transport.

I think it could just be some spoof town made up to catch out people who have been copying map data from google maps without accreditation. If some map or tourist guide mentions Tilehurst, the google maps team will know that they are using Google map data and immediately demand some financial recompense, routed through the Ireland subsidiary.

There's only one way to be sure: using this resolution map as the cue, drive there and see what it is.

2013-05-06

Strava bringeth bad news

I'm in the bay area right now, and the new owner of a Google Nexus phone, which is very good at integrating with Google apps, including calendar, contacts and mail. It also runs Strava, the defacto standard app for logging your cycling, then uploading the results to compare with others. I'm assuming its running Hadoop at the back end, given their Platform Product Manager is one of the ex-Yahoo! Hadoop team.

If this is the case, Hadoop is indirectly bringing me bad news.

Yesterday I went out on the folding bike and climbed the Santa Cruz mountains, west of the Bay Area flatlands.

It's a great steep climb from behind Palo Alto up to the skyline ridge, narrow and free from almost all traffic bar the many other locals who felt that Saturday morning was to nice to waste. For me: 26 minutes climbing, 40s of rest -fast enough to come in the top 50% of the day of everyone else running the app, and hence 2233 of 5810 of everyone who has ever done it. Not bad work.

Good news from strava: 62/145 on Old La Honda

If there's a warning sign, it is that people faster than me have a quoted average power output less then my 257W -and as they all took less time, that means that their total exertion is less than my 412kJ. Why do those ahead of me come in lower? It means they are carrying less excess weight.

If I'd stopped there and descended (carefully, that bikes 20" rims overheat), I could get back and feel relatively smug.

Only this time I descended the far side and climbed back up "Alpine West" -first time ever. And it destroyed me

Bad news from strava: 28/29 on west alpine

It was long. I stopped for lunch partway up, but needed that rest, and continued, expecting the overall climb to be on a par with the earlier one. It wasn't. In fact the total climb was double, 600m. Which I was not ready for. Unlike the morning, where I'd got to pass lots of people, there the road was near empty -and those people I did meet were going past me. The rate of ascent, 562m, is less than the 600m/hour rate we used to plan for when crossing the alps -a rate sustained over a week or more, carrying panniers. Not today.

The message from Strava then, which means something Hadoop worked out, is that I am overweight and completely lacking in endurance. It' doesn't quite spell it out, but the graphs bring the message

This is datamining at work.

2013-04-09

Software updates are the bane of VMs -and Flash is its prophet

It's the first tuesday of the month, so it's Flash update time. Three critical patches, where "critical" means "if you don't update it your computer will belong to someone else"

These flash updates are the bane of my life. I have to update the three physical machinesin the house, and with two of them used by family members, I can't ignore updating any of them.

The workflow for flash updates is

Open Settings manager, find the flash panel, start that, get it to check for an update.
If there is one, it brings up the "a new update is available, would you like to install it"? dialog.
Flash Control Panel opens up firefox with a download page: start that download.
Close down Firefox
Close down Chrome
Close down the Flash Control Panel (if still present)
Close down the settings manager
Find the flash .dmg file in ~/Downloads
open it
click on the installer
follow its dialog
eject the mounted .dmg image
restart your browsers. This is always a good time to look for Firefox updates too, then check if it recommends any other browser updates.
For all gmail logins, the two-level auth.

That's a repeat 3x operation, with the extra homework that on a multi-login machine, I have to "sudo killall firefox && sudo killall chrome" the other user's browser instances to make sure that the update has propagated (the installer doesn't block if these are running, as it doesn't look for them).

Then comes the VMs. Two windows boxes stripped down to the minimum: no flash, no MSOffice, or Firefox, but Chrome and IE. IE setup to only trust adobe.com, microsoft.com and the windows update, where trust is "allow installed AX controls".

Manual updates there too, with the MS patch also potentially forcing restarts.

This show the price of VMs: every VM needs to be kept up to date. The no. of VMs I have to update is not O(PCs), it is O(PCs)*(1+O(VMs/PC)

Most of the VMs are on my machine, one Linux VM for native code builds, other VMs for openstack, more for a local LinuxHA cluster

There it is simpler, "yum -y update && shutdown -h0" or the same for "apt-get -update".

Which shows why Linux makes the best OS for my VMs. It's not so much the cost, or the experience, but the near-zero-effort update-everything operation.

It also show's where Apples "app-store" mind view is limited. Because new App-store apps must be sandboxed and save all state to their (mediocre) Cloud, there's no way for the Appstore to update browsers or the plugins integrated with them. Which leaves two outcomes

Someone needs to go to each mac and go through steps 1-14 above.
They don't get updated, and end up being 0wned.

It's easy to fault Apple here, but it really reflects a world view that we have for software in general, "out of band security updates are so unlikely we don't need to make it easy". Once we switch to assuming that there may be an emergency patch any day of the week, we start thinking "how would I do this as a background task" -which is something all of us need to consider.

2013-03-25

Today's Smart TVs: AOL for the living room

I was pleased to hear that Palm had been sold by HP to someone who may care about it: it will have a life beyond the grave. It's not that bad a platform: Linux underneath, HTML + JavaScript on top, with things like Node.js for the threading library. It shows that you don't need a new programming paradigm (iOS, Android) to write applications for mobile devices, just HTML + JS + device service access.

This is effectively what Chromebooks are trying to provide, as well as some of the features of HTML5. It'll be interesting to see how much resistance that gets from the phone manufacturers. I expect google to be happy, Apple: more reluctant.

And the TV vendors? They've clearly decided that now that television screen diameters have reached their sensible limits for most households, and recognised that 3D as a feature has died, they need a new way to convince everyone to renew their televisions on a three year cycle, and to charge a premium for those new televisions.

Well, a 3-year cycle is the replacement cycle for desktops and games consoles, probably longer than the lifespan of today's phones and tablets, which are on a faster evolutionary curve. The TV vendors must look at the lifespan of tables, and think "we'd like that".

The challenge, then is simple: convincing customers that their existing television is already obsolete, and that that they need a new television (that will also be rapidly obsolete, though that isn't made clear). They also want to charge margins above basic "monitor".

The SmartTV, then represents their strategy. Rather than let the Games consoles evolve to be general purpose Entertainment Consoles, the TV vendors want that money. They're probably realistic to recognise that they can't get in to the existing console business model: a few big games a year, but they will look at phone/tablet app store purchases and think "$10 per app works out if the #of apps increases". Which is of course something that the games console vendors have noticed and are trying to adapt to.

The TV vendors are also unwilling to let anyone else get a toehold. Google TV never took off, they would run from Microsoft making a similar offering. That's a short term strategy which would be killed if Apple were to produce a TV that took the premium market. They must hear the rumours and think "we need a story of our own" -the LG purchase of Palm represents that.

I can see their thinking, but think they will have to change what they deliver to customers in terms of UX to stand a chance against Apple.

This an opinion based on having owned an LG "Smart TV" since January. It is not a Smart TV: it is a monitor with aspirations to be AOL.

The driver for retiring out nearly-ten-year old CRT television was the sprog's acquisition of a PS3 for his birthday: finally we had HD content to display. Getting a new television was now justifiable.

My requirements were: LED TV good for DVD, Blueray and Games, Freeview HD. The right size for a large, high-ceilinged room that didn't dominate the room; lots of HDMI ports, RGB in. 3D was something that games could take advantage of, so that was on the list if it didn't add too much money. "Smart TV" wasn't something I cared about, as the PS3 was where iPlayer and Netflix would run. I made sure we picked up a "PS3 slim" not the more recent "super slim", for a better blueray loading experence.

The day after Xmas, then, I walked down to Richer Sounds to get a TV to match my requirements, having already sized things up (very nice panasonic iPad app there to simulate a TV on the wall), and explored the options. The TV we got was a 47" LG LED panel, lots of HDMI ports, and at a price point which I was prepared to pay for a TV that I expect to retain for another 6-10 years.

The fact that it was an internet ready SmartTV was a non-issue; I hadn't even intended to wire that bit up to Ether.

We ended up rolling the AV receiver to one with HDMI switching (the old one will move into my office for its sound system) -the new receiver had Airplay from Ether, so the TV zone ended up getting a 4x1GbE ether switch hooked up to the Ether over Power backplane I've been running for a while.

As a result, I can now experience SmartTVs in all its glory.

Like I said, it reminds me of AOL. And perhaps a Windows 98 PC in 1997, when all the dotcom startups were paying the home PC vendors $20 just for an icon on the desktop or a bookmark in IE4:

The left third shows live input (top half) and some notification about new content and a product advert (bottom half). That's an advert on a television I paid for, one I can't disable. Using my internet.

That's the AOL feature.

The central third is the "premium" services, which includes "all possible premium services", not "the only ones you are signed up to". It has the three we use: iPlayer (free playback of most BBC TV and radio content of the previous 7 days), Netflix and youtube. The others: I'm not going to sign up for them, yet they are permanently there, taking up space and delivering no value to me.

I suspect that the vendors may give LG a kickback if someone signs up through the TV.

Moving right, there's some other pane, and more off to the right, none of which anyone can be bothered to explore.

What I do see right at the end is the option to create my own "my apps" pane. I was glad to find this, confident I could now set up the TV with the things I wanted, rather than have the services I wanted hidden in the clutter.

Except: you can't add "premium" services to "my card". They aren't on the list of selectable services.

There must be some separate array of "premium services" from "standard services", with only the standard services being configurable. Two separate arrays, two ways to keep them up to date. Separate tests.

Having to make do with my no-quite-my-card, I can now move it onto the mainscreen and get some of the clutter out the way

Though again, no ability to move it left of the premium card. That's fixed, with a message at the bottom "cannot move live card and premium card". Someone has gone to the effort of fixing the minimum position of all customisable cards to panels[x] where x>=2, written the tests for it, i18n'd he "cannot move" message.

There we have it then: A UI that takes up 1/6 of the screen space with adverts, clutters up the main screen with that and a pool of premium services that nobody would have more than half of, and which doesn't let me clean up.

In comparison, Apple's "we control your tablet" philosophy is a bubble of flexibility, as I can choose whatever is on the start screen and on the app bar at the bottom. Not in LG "SmartTV" land.

As the for the applications, they work, iPlayer will happily stream Graham Norton down in HD, which is something I personally consider a defect. You can also mark it as a favourite, which I consider a defect in an individual.

Even so, the viewer isn't as good as the PS3 options. iPlayer's scroll forwards/backwards is very crude, accurate to about 5 minutes, rather than the 30s or so that the PS3 version offers. It's got pretty bad latency in some of the navigation features, implying there's not much caching going on -memory limited?

As for Netflix: you can't add ratings to get better recommendations, you don't get the ability to see the "similar to" recommendations on any film. It's a worse UI than on an iPad.

Which raises a key issue which LG and all the SmartTV vendors have: convincing anyone to code for their devices.

This is the problem that phones have had, which Apple solved by "having massive market share in markets they effectively created, and providing a good user experience for their users, especially if they have a laptop, tablet and phone all from Apple". Google have allowed the other phone and PC vendors to play catch-up through Android,

Phone and tablet developers then, have a small set of options.

Apple. Essential if you do tablet work, important if you do phones. Their own programming language and tooling, oppressive qualification process -offering users trustable apps when they've finished. What's nice about apple: a minimal number of platforms to test on, and with new OS releases backported, no reason not to adopt the latest features.
Android. The other app platform: Java language and compatible runtime; open to all vendors, though customers get different backport experiences based on phone vendors. For developers: a lot more testing, and you have to worry about which OS versions are in use in the field. Support calls are probably worse. In favour though: one core codebase for all Android phones.
HTML5. Viable if you are targeting an on-line only world, though phone support here has been weak (cite: facebook's move from HTML5 to apps).
There's also Windows Mobile, which may be too late as an app platform, and will have to focus on delivering an excellent HTML5 experience.

How is any one SmartTV vendor going to play here if these platforms move into the TV world? Either they talk to MS or Google and say "we can't do platforms, help us". Or they say HTML5 is all we need, and work to deliver a really good HTML5 experience (that DRM in HTML5 may help or hinder here. Help: let Netflix and those traitors to openness at the BBC deliver apps, hinder if they can't actually get the closed codec/auth modules).

I don't see LG's acquisition of Palm being sufficient to stop them being forced to copy the phone/tablet strategies.

If Apple comes to play, they can take advantage of their tablets, phones and iPod touches, make these the personal GUI for the TV, recognise that multiple people in front of the TV will have them, and provide an app platform that lets developers write apps that not only can work on tablets, phones and TVs -but can even work between them. NetFlix does some of that already -their tablet/phone apps can tell the PS3 and the TV to play content, which helps compensate for some of the limitations of the TV app.
Google can come to the other vendors and say "here's a way out". Samsung are already making Android phones and tablets -I'd expect them to go with Google. Sony have android phones too, but they have the PS4 to work on too -and presumably see that as more strategic than smart TVs.
LG? Palm? They don't have the market share. Unless they can get together with the other TV vendors and say "here's an independent strategy" -and have them listen.

Irrespective of what strategy today's TV vendors take, one thing they have to recognise is that their AOL-class GUI isn't going to cut it. If in-TV applications take off, the quality of the UX is going to matter, and right now, they are 15 years behind what PCs, Phones and tablets are offering.

2013-03-18

Defeated by iPad synchronization options

Last January I got an iPad mini as a travel accessory to the laptop: music, eBooks, PDF formatted papers, online and offline maps, etc.

It's also intended to be holder of travel paperwork: the schedule, logistics notes, eTickets, hotel details. All mostly PDF, though my KLM check in has just emailed a GIF QF barcode which apparently will get me through security (outbound I'm testing w/ a backup paper one, return: commit to GIF).

A major use case of mine then is: get PDFs off my laptop and into the iPad so that I can bring them up and view them.

Which is where it all seems to go horribly wrong.

I can see four different synchronization options.

iTunes
Copy into the books section of iTunes and let them trickle over via USB or wifi.
this works, provided the devices can see each other in the same wifi subnet.

It is a bit clunky as I have to drag and drop content from my folder of travel bits (e.g. 2013-03-AMS) and store them in a flat pool of documents, where they end up mixed next to things like Grinstead and Cell's Introduction to Probability, papers on things like Chubby, and copies of Singletrack Magazine. This isn't ideal for navigating at the airport security gate.

But again: it works, and I know how to verify that the stuff has trickled over -you look at the sync status page.

Workflow

Clean up the last trips' documents.
Copy in the new files
Force a sync to make sure it is over.
Updates: steps 2 and 3.

Apple iCloud

This is meant to be the future. Instead of saving to the filesystem, you save it to "the cloud" where it will magically make its way over to your other devices.

Except I put PDFs in there and there doesn't seem to be any obvious way to actually see that they have made it over, let alone open it.

This is not a Cloud, it is /dev/null with unrealistic promises. I could say that if I copy a file to /dev/null then all my other devices will get the same view of the copied documents -but if they aren't there, it's not a very good view.

The workflow for getting documents over via iCloud is therefore

Save the files into iCloud.
Pick one of the other synchronization options to get your content over.

Dropbox

The folder metaphor, I can drag and drop anything in on my desktop to it, it trickles over across all my desktops, OS/X and linux.

What it doesn't do is automatically trickle the files over to the iPad. It copies the directory metadata over, but I seem to have to tap every file -by hand- for it to decide that it's going to download every artifact in the filesystem.

For anyone with a default 2 GB Dropbox account, you could copy everything over while on wifi and not use up any device space, even for customers like me who went for the low-end 16GB model because they felt the cost/GB of extra SSD in an iPad was utterly excessive.

The workflow to sync is therefore

Save all the content into a dropbox managed folder
go to the tablet and find that folder
go to every file in it, and manually hit the download button, wait for it to D/L. Repeat for all files in a process that is O(files)*O(filesize).
The update process is steps 1-3, repeated.

Box
I also have a box account, and an iPad app for that. This has some flags about auto-syncing on wifi only, which I'm happy with, not having a device with a modem in (tethering & wifi usually suffice).

It also has -and this got me excited- the ability to mark folders as "favorite", where it is claimed that content will auto-sync to the pad. I was hopeful here, marked my folder travel as fave and then put stuff into trip-specific subdirs underneath, for this week's trip, next month's US trip, and others.

I go over to the 'pad, expecting the files to be there.

Only they aren't, because the favoured bit is not recursive.

Once you know that, Box sync becomes manageable

Create a folder for each planned trip.
Copy travel docs in there.
Go to your table, and mark that folder as favourite, even if the parent dir is already marked as such.
Update the travel folder on the laptop -things will now trickle over.

Remember step #3 and it does work.

Email
Just mail them to yourself the day before you travel, download the mail and make sure the files are there.

This appears to work, though there are probably limits on how big the files are that are auto-D/L'd, and the files go away at the history rate specified in the mail app, which is no good for a long trip.

Workflow

Make sure the mail app is set to cache data for >= the length of your trip.
Email the PDFs to yourself.
Verify in the mail app that they have all arrived.

The nice feature about this is that it works from everywhere, whether or not box is installed. You can also get other people to contribute to the document pool by having them email you direct.

There you have it: ways I've tried to sync documents.

If iCloud actually seemed to do what is promised "share your content across devices via Apple's cloud" then it might work, even if it's metaphor "not a filesystem, but a place where every artefact is permanently bonded to whichever application put it there, even if there is >1 text or PDF viewer on a device", is so dumbed down it represents a step back to Mac 1.0.

Unfortunately the behaviour I see "the same consistency and durability model as copying files to /dev/null", means that there is no way I would trust it with anything. I actually hope there may be something obvious I'm missing here, as I can't understand how something so dire would spring into existence, and don't believe the business plan "make money from premium users who want to store more stuff" stands a chance against tools that actually work.

Instead I've settled on Box, making sure that things go over (opening a non-random sample of them -I should toss a coin over each file for better randomness.

Oh, and print out my boarding card and the map from the train station to the hotel.

[Photos: some of the CCTVs I saw on a single walk down Gloucester Road. I'm not quite sure what problems this high street had that needed near-ubiquitous CCTV coverage, but there's enough cameras to have fixed it. I like the one pointed straight at the ATM the best]

2013-03-10

Enterprise Hadoop: yes, but how are you going to fix it?

EMC's Pivotal HD has started a lot of debate as to whether building on top of Hadoop can be considered being part of a Hadoop ecosystem or whether it's an attempt to co-opt it: to do something and claim that it is part of a bigger system.

Can you say you are "part of the Hadoop stack" when all you are doing a closed source layer on top? I think that's quite nuanced, and depends on what you do -and how it's interpreted.

The Apache License grants everyone the freedom to take the source away and do anything they want with it
There is no requirement for you to contribute a single line of code back; a single bug report.

This is a difference between the ASF license and GPL-licensed software which you redistribute: with GPL code the changes must (somehow) be published.

Other aspects of the ASF license:

You can't abuse ASF brand names, which in the Apache Hadoop world means you can't use Apache Hadoop, Apache HBase, Apache Mahout, Giraph, Apache Pig, Apache Hive, etc in your product names. There are some excellent guidelines on this in the wiki page Defining Hadoop -and if you want actual feedback, email the trademarks@ list. It may seem that doing so removes the secrecy/surprise factor of your product announcement, but it's better that than a hurried renaming of all your product and documentation.
If you sue other users of the product over patents of yours that you believe apply to the technology -you revoke your own right to the software. I haven't known that to happen with Apache products -though the Oracle/Google lawsuit did cover copyright of APIs and reimplementations thereof. If APIs ever became copyrightable, then decades of progress in the computing industry will grind to halt.

People are also free to look at Apache APIs and clean-room re-implement them; you just can't use the Apache product names at that point. Asserting compatibility becomes indefensible: if you look at the ASF JIRAs, even 100% compatibility across versions is hard to achieve -that's with the same source tree. It's not the binary signature that is (usually) the problem, its what happens afterwards that's trouble. Little things like whether renaming a file is atomic, or what happens when you ask for the block locations of a directory.

Now, what about introducing a closed source product on top of Hadoop and saying you are part of the hadoop ecosystem, that you have x-hundred people working on Hadoop?

This is where it gets tricky.

Some people say "it's like building on Linux" -and there are some very big closed applications that run on Linux. A big one that springs to mind is Oracle RDBMs.

Are the thousands of people who work on Oracle-on-Linux "working on Linux"? Are they working on "Oracle on Linux", or are they working "on Oracle", on Linux?

Whatever way you look at it, those people aren't working in the Linux OS, just on something that runs on top of it . Would you call it part of the Linux "stack", the way MySQL and Apache HTTPD are?

Personally: I have no idea.

What probably doesn't happen from Oracle's work is any direct feedback from their application into the OS. [Correction: it does, thx @tlipcon]. I also doubt that RedHat, Novell and others regression test Oracle RDBMS on their latest builds of Linux. By their very nature, closed-source applications fall out of the normal OSS regression and release test processes, that rely not only on the open source trees, but the open test suites. This is also why Oracle's actions in not releasing all tests for MySQL seems so short sighted: it may hurt MariaDB, but it also hinders Linux regression testing.

Breaking that link between the OS and the application means that Oracle have not been in the position to rapidly adapt to problems in the OS and filesystem, because there's no way to push their issues back upstream, to get changes in, to get new releases out in a hurry to fix a problem with their application or hardware. Instead the onus becomes on the application to deal with the problem themselves.

How have Oracle handled this? Eventually, by getting into the Linux Distribution business itself, with Oracle Unbreakable Linux. By releasing a complete OS build, they can coordinate OS and application releases, they can fix their version of the OS to handle problems that surface in Oracle's applications -on a timetable that works for them. They also get to handle Oracle hardware support in a timely manner, and charge support revenue from users.

That works -at a cost. By forking RedHat Linux, Oracle have taken on all the maintenance and testing costs themselves.

The amount that Oracles charge has to cover those costs, or the quality of the Oracle fork of Linux degrades relative to the reference points of RHEL and Debian.

For Oracle, or the combined OS+11g+exadata deal has enough margins in the database that they can come up with a price is was less than ({HP | Dell}-RHEL-Oracle11g), and so presumably those costs can be covered. What's not clear is this: did Oracle get into the business of selling a supported Linux because they saw money in it, or because they concluded that their hardware and database products effectively mandated it?

Other companies getting into the business of redistributing Hadoop-derived products to customers who are paying those companies in the expectation of support are going to have start thinking about this.

If you have just sold something that has some Hadoop JARs in it -code that the customer depends on- and they have a problem, how are you going to fix it?

Here are some strategies:

Hope it won't be a problem. Take the Apache artifacts, ship as is. It is, in the opinions of myself and my Hortonworks colleagues, production ready. Push customers with problems to issues.apache.org, forward them yourself. You could do the same with CDH, which in the opinions of my friends at Cloudera, also production ready.Do that, and issues on Apache JIRA will be ignored unless you can replicate them on the ASF artefacts.
Build your own expertise: this takes time, and while that happens you aren't in a position to field support calls. If you make your own releases, you end up needing your own test infrastructure, QA'ing it, and tracking the changes in hadoop trunk and branch-1.
Partner with the experts: work with people who have in-depth understanding of the code, it's history, why decisions were made and experience in cutting production scale releases suitable for use in web companies and enterprises. That means Hortonworks and Cloudera. Many of the enterprise vendors do this, because they've realised it was the best option.

The web companies, the early adopters went for #1 and ended up with #2: build your own expertise. This is effectively what I did in my HPLabs work on dynamic in-cloud Hadoop. You can see my journeys through the source -while working on big things, little things crop up, especially problems related to networking in a virtual world, configuration in a dynamically configured space, and recovery/sync problems that my service model discovered. I still know my way through a fraction of the code, but every project I work on builds up my understanding, and contributes stuff back to the core, including things like better specifications of the filesystem API's semantics, and the tests to go with it.

That trail of JIRAs related to my work shows up something else: if you are delving deep into Hadoop, your reading of the code alone should be enough to get you filing bugs against minor issues, niggles, potential synchronization, cleanup or robustness problems. If you are pushing the envelope in what Hadoop can do: bigger issues.

We are starting to see some involvement in hadoop-core from Intel, though apart from the encryption contribs, it still appears to be at an initial state -though Andrew Purtell has long been busy in HBase. We do see a lot activity from Junping Du of VMWare -not just the topology work, but other big virtualisation features, and the day-to-day niggles and test problems you get working with trunk. Conclusion: at least one person in VMWare is full time on Hadoop. Which is great: the more bugs that get reported, the more patches, the better Hadoop becomes. Participating in the core code development project develops your expertise while ensuring that the Apache (hence Hortonworks and Cloudera) artifacts meet your needs.

Are there other contributors from EMC? Intel? I have no idea. You can't tell from gmail & ymail addresses alone; you'd have to deanonymize them by going via LinkedIn. That's not just name match; you can use the LI "find your contacts" scanner to go through those people's email addresses and reverse lookup their names. Same for twitter. I may just do that for a nice little article on "practical deanonymization".

In the meantime, whenever someone comes to you with a product containing the Apache Hadoop stack, say "if there is a problem in the Hadoop JARs - how are you going to fix it?"

[Artwork: See no evil by Inkie, co-organiser of the See No Evil event. Clearly painted with the aid of a cherry picker]