2012-11-14

A Hadoop Standards Body? It's called the Apache Software Foundation

I am writing this on the ICE502 train from Mannheim to Frankfurt. To my left, my friend Paolo Castagna pages through the emails from Cloudera HQ that are slowly trickling into his phone; I'm out of network range so can't go over the small-kids (kleinerkinder) compartment and skype in to a Hortonworks team meet.

We are on our way back from ApacheCon EU.
Zooming in

Over the last week, the topics of the talks I've attended have included (and omitting my own): Cassandra development, RDF processing in Apache Hadoop (ask Paolo there), Logging futures, post-Apache Maven build tools, Apache Open-Office cloud integration, Cloud Stack, Apache HBase status quo -Lars show how all the HDFS work we've been doing is really going to benefit Apache HBase there, NoSQL ORM, Apache Mahout, and many others. A large proportion of the Apache Hadoop Datacentre Stack is there -and we can sit down and discuss issues. It may be an internal issue: how to move away from commons-logging; it may be something cross project, such as how HDFS could let HBase explicitly request a block placement policy for each region server that kept all replicas on the same rack., or it could be something indirectly relevant like Apache Open Office slideshow improvements.

We've been treated to slides from Steve Watt of HP showing their prototype Arm-64 server systems, which will offer tens of servers in a 2U unit -a profound achievement. We've been treated to some excellent beer at the Adobe reception, which went from 18:00 until we were evicted at 21:00.

I met lots of people, some I knew, some I'd never met face to face before, some who were complete strangers until this week. We've been in the same talks, eaten at the same tables, drunk beer in the two restaurants and the cafe in this town, discussing everything from OSGi classloading in Apache Karaf, Jumbo Ethernet frames and what to do when remains of a decomposing whale ends up in your datacentre. Those people I was in the cafes included Lars George (Cloudera), Steve Watt (HP), Isabel Drost (Nokia), and three people who had a whale-related incident in their facility.
A whale? a whale?

Not once did anyone say: "Let's give some standards body the Apache Hadoop trademark and the right to define our APIs as well as the exact semantics of the implementation!"

Nobody said that. Not even whispered it.

Because from the open source perspective, it makes no sense whatsoever. The subject that did come up was "Jackson versioning grief -which relates to an open JIRA.

I gave a talk saying there is lots of work, and pointing people at svn.apache.org, and issues.apache.org , saying "get involved" -and discussing how to do so.

Key things to do
  • gain trust by getting on the lists and being visible (and competent, obviously)
  • help review other people's patches than just your own
  • don't try and do big things in Apache HDFS (risk of data lost) or Apache MapReduce (performance and scale risks).
What I did emphasise is that we do want more people helping -and that we need to improve how this is done. I did not suggest that we could do this through "under an industry forum—either an established group or one that is specifically focused on big data.".

What I suggested was -and these are entirely personal opinions -
  1. some mechanism for mentoring in external development projects, so that they don't fail, get neglected, or appear without any warning -and creating integration problems.
  2. better distributed development, so that those of us outside the Bay Area can be involved in the development. Google+ events, more pure-online meetings in various timezones. The YARN event that Arun organised is something I want to praise. here: we remote attendees got webex audio and remote slideshare. Even so it was very late in the EU evenings and there's always an imbalance between people in the room -the visible, vocal audience, and people down the speaker phone.
  3. better patch integration through Git and Gerrit. Even if svn is the normative repo, we should be able to accept patches as pull requests that go through Gerrit review; people can update their patches trivially through merging trunk with their branch and pushing out their branch to a public repo.
I also mentioned tests. Not just tests of new features -where we are obsessive about "no features without tests", but in improving the coverage of the system, and formalising the semantics of the system.

If there is ambiguity in the behavior of bits of Apache Hadoop, tests added to the Apache  source repository, svn.apache.org, define that behaviour. Regression testing the entire stack finds problems, which is why we love to do that -especially things like testing how repeated runs Apache HBase's functional tests suites succeed while our test infrastructure is triggers NameNode failover, or how the deployment of Yahoo!'s existing applications on the new MRv2 engine in YARN improves performance at those applications -while finding any regressions in MRv2 from the MRv1 runtime.

Testing against Apache Hadoop is the way to guarantee compatibility with Apache Hadoop -because the Apache Hadoop code is Hadoop.

At the root of the svn.apache.org/hadoop source tree, in the Apache tarballs and RPMs, and in those products that include the ASF artifacts or forks thereof is a file: LICENSE.TXT
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
What does that mean? It means:

Anyone is free to write whatever distributed filesystem they want, implement whatever distributed computing platforms on top if that they choose -but they cannot call it Hadoop.

There's a nice simple metric here:

If you can't file bug reports against something in issues.apache.org, it's not an apache product, and hence not Apache Hadoop

For that reason: I'm not convinced that the Hadoop stack needs to care about the compatibility concerns of people trying to produce alternative platforms, any more than Microsoft needs to care about the work in Linux to run Windows device drivers.

No comments:

Post a Comment

Comments are usually moderated -sorry.