Microsoft and Hadoop: interesting

donkeys like music too

I am probably expected to say negative things about the recent announcement of Microsoft supporting Apache Hadoop(tm) on their Azure Cloud Platform, but I won't. I am impressed and think that it is a good thing.
  1. It shows that the Hadoop-ecosystem is becoming more ubiquitous. Yes it has flaws, which I know as one of my IntelliJ IDEA instances has the 0.24 trunk open in a window and I am staring at the monitoring2 code thinking "why didn't they use ByteBuffer here instead of hacking together XDR records by hand" (FWIW, Todd "ganglia" Lipcon says it aint his code). Despite these flaws, it is becoming widespread,
  2. That helps the layers on top, the tools that work to the APIs, the applications.
  3. This gives the Hadoop ecosystem more momentum and stops alternatives getting a foothold. In particular the LexisNexis stuff -from a company that talk about "Killing Hadoop". Got some bad news there...
  4. Microsoft have promised contribute stuff back. This is more than Amazon have ever done -yet AWS must have found truckloads of problems. Everyone else does: and we file bugs, then try to fix them. I could pick any mildly-complex source file in the tree, do a line-by-line code review and find something to fix, even if its just better logging or error handling. (don't dismiss error handling BTW
  5. If Amazon have forked, they get to keep that fork up to date.
  6. If MS do contribute stuff back, it will make Hadoop work properly under Windows. For now you have to install Cygwin because Hadoop calls out to various unix commands a lot of the time. A windows-specific library for these operations will make Hadoop not only more useful in Windows clusters, it will make it better for developers.
  7. MS will test at scale on Windows, which will find new bugs, bugs that they and Hortonworks will fix. Ideally they will add more functional tests too.
  8. I get to say to @savasp that my code is running in their datacentre. Savas: mine is the networking stuff to get it to degrade better on a badly configured home network.  Your ops team should not encounter this.
It's interesting that Microsoft have done this. Why?
  • It could be indicative of a low takeup of Azure outside the MS enterprise community. I used to do a lot of win32 programming (in fact I once coded on windows/386 2.04); I don't miss it, even though Visual Studio 6 used to be a really good C++ IDE. It is nicer to live in the Unix land that Kernighan and Ritchie created.(*)
  • Any data mining tooling encourages you to keep data, which earns money for all cloud service providers.
  • The layers on top are becoming interesting. That's the extra code layers, the GUI integration, etc.
  • There's no reason why enterprise customers can't also run Hadoop on windows server within their own organisations, so integrate with the rest of their world. (I'm ignoring cost of the OS here, because if you pay for RHEL6 and CDH then the OS costs become noise).
  • If you are trying to run windows code as part of your MR or Pig jobs, you now can. 
  • If you are trying to share the cluster with "legacy" windows code, you can.
Do I have any concerns?
  • Somehow I doubt MSFT will be eating their own dogfood here; this may reduce the rate they find problems, leaving it to the end users. Unless they have a fast upgrade rate it may take a while for those changes to roll out. (Look at AWS's update rate: sluggish. Maybe because they've forked)
  • To date, Hadoop is optimised for Linux; things like the way it execs() are part of this. There is a risk that changes for Windows performance will conflict with Linux performance. What happens then?
  • I forsee a growth in the out-of-depth people trying to use Hadoop and asking newbie questions now related to Windows. Though as we get them already, there may be no change.
  • I really wish Windows server had SSH built in rather than telnet. Telnet is dead: move on. We want SSH and SFTP filesystems, and an SFTP filesystem client in both Windows and OS/X. It's the only way to be secure.
  • I hope we don't end up in an argument over which underlying OS is best. The answer is: the one you are happy with.

(*) At least for developers. I changed the sprog's password last week as were unhappy with him, and when I passwd'd it back he asked me "why do I use the terminal?". One day he'll learn. I'll know that day as I he'll have changed my password.

[Artwork: unknown, Moon Street, Stokes Croft]

1 comment:

  1. One interesting thing to note is that Hadoop was used in Powerset which was acquired by Microsoft and now part of Bing search. Read that later Hadoop was replaced with something else. Not sure of the commitment of Microsoft around Hadoop. FYI, HBase was started by Powerset.

    I am not an expert in the performance of cross platform applications, but it might be a challenge to have Hadoop run efficiently on both Linux and Windows without effecting the performance of other.

    The good thing about the whole thing is that Hadoop will run on more platforms and will be useful to 'I am new to Linux and how do I get the list of files' type of guys to get around Hadoop more easily.


Comments are usually moderated -sorry.