2012-04-05

Joining Hortonworks to evolve #hadoop

Mountain Biking the Black Mountains

I've left HP. I did that on Monday, enjoying a final beer at lunchtime with my soon-to-be-ex colleagues, then heading home for a few weeks of parental responsibilities during the easter break.

Later this month, I will start work at Hortonworks, pushing the Hadoop stack forwards. I am really excited about this -I know a lot of people in the company already, and it's going to be great working with them!

Although the phrase "Big Data" is getting overused, it's obvious to me that there is a real coming together of different trends to make the whole Hadoop-based ecosystem as transformational as web servers were.
  1. There are so many devices in the modern world acting as data sources -physical devices such as mobile phones and jet engines, services such as web applications, people making use of devices and services. 
  2. In the past less data was generated -and it was normally thrown away. Too expensive to store, no perceived value.
  3. The cost per TB of HDD has fallen such that you can now afford to keep that data for later analysis
  4. You can't analyse it on single servers as the bandwidth of HDDs hasn't increased at the same rate as the storage capacity.
  5. The performance of a single CPU has effectively topped out too. All that is coming is more cores, more operations/joule (hopefully), different forms of parallel computation. The free speedups that the CPU vendors used to dish out are over. It's either single-machine parallelism or multi-machine. Oh, and either way: heterogeneity of some form or other.
  6. That means everyone is going to have to embrace parallel computing, on the single machine or in the rack -and with the right algorithms,  that rack can be made to deliver linear and sometimes superlinear speedup.
  7. If you want to work with the big datasets that you can collect today, you are going to need a rack of servers and a framework to let you process the data.
  8. The Hadoop platform provides the framework to store the data across those hard disks, and to distribute the work across them. It is becoming the single open-source alternative to Google's internal platform.

Where the future gets really interesting is that the Hadoop ecosystem provides those core services of a distributed computing platform: bulk storage (HDFS), scheduling (MRv2),  distributed state (ZooKeeper), integration with existing infrastructure (flume, squoop, Hive). These services can be used to build applications in and above Hadoop -HBase and Giraph are key examples; Cassandra a welcome friend. Big Data is the immediate reason to move into world, but ultimately it's Big Datacentre -not things like Java EE7 that just seem, well, so very last-century.

That's why I'm joining Hortonworks -to go full time on building the future platform for server-side computing.

[photo: preparing to descend into Crickhowell, Wales, 2011]

3 comments:

  1. Nice one Sir. I'm also learning to understand Hadoop from Tom WHite's Hadoop book. I graduated a year ago as Software Engineer, worked as Oracle DBA for over a year and have now decided to go into the Hadoop domain.

    Harmeet
    twitter.com/oraa1

    ReplyDelete
  2. Are you aware of work around swapping out HDFS for something else, like Swift...

    ReplyDelete
  3. Michael: I'm actually the hortonworks part of the team doing the Hadoop Swift Integration, https://issues.apache.org/jira/browse/HADOOP-8545.

    Swift is a blobstore, not a filesystem, and has different semantics from what Hadoop apps expect, MapReduce and HBase for example. Also its slower.

    ReplyDelete

Comments are usually moderated -sorry.