Visiting the France Hadoop Users Group

At the invitation of Cedric Carbone from Talend, I went over to Paris to join in the third France HUG event, giving my talk and listening to the others.

It was really good to meet a group of people all of whom are involved in this stuff. These were Hadoop users, not "curious about all the hype" users -and I enjoyed not just talking about Hortonworks, Hadoop and HDP-1, but listening to what they are up to and the issues they have. I also got some lessons in technical french -all the Hadoop product names are designed for the language.

Now my HDP-1 slides:
I gave a quick demo at the end of the stuff I'm doing on availability -not the bit where vSphere recognises and kills a VM hosting a failed service, restarting it on the same physical host or, if that host itself is in trouble, elsewhere. Instead I showed how the JT can be set up to not only reload its queue of jobs, but go into "safe mode" either at the request of a remote administrator, but also when it detects that the filesystem is offline.

Being able to put the JT into safe mode is beneficial not just for dealing with unplanned availability issues, but planned DFS maintenance. When you flip the switch on the JT it doesn't kill tasks in flight, but it doesn't worry if they fail; it doesn't blacklist tasktrackers or consider the job as failing. You can't schedule new jobs in safe mode either -while that's implicit in that HDFS isn't there to save your JARs, this just makes it more formal. When the FS comes out of safe mode, it reschedules the queued work (the whole job), and new requests can be added.

The UI can show safe mode too -though I've realised that before the code is frozen, the JT setSafeMode() call should take a string explaining why the system has entered this state. It would be set automatically on DFS failure, while a manual request would pick up whatever explanation you asked for, e.g. "Emergency Switch maintenance". Anyone who goes to the JT status page would see this message.

DFS could have the same -in fact maybe the JT ought just to support a message of the day feature. That's feature creep: knowing why things are down is better. Indeed, I could imagine it being part of the payload of exceptions clients get: "W. says no jobs today".

The other talks gave me a chance to revise some French, with HCatalog and an attendee of the Hadoop Summit giving their summary.

Afterwards: baguette, fromage, Kronenbourg 1664 -stuff worth travelling over for.

One interesting discussion I had was on the topic of ECC DRAM in servers. Having done all the reading on availability last year, I not only think that ECC is essential in servers, the time when it should be in desktop systems is drawing close. When machines are shipping with 4+GB of DRAM in even a laptop, P(single-bit-error) is getting high, as Nightingale's "Cells cycles and platters paper showed.

Yet as the attendees pointed out, today you are being given a choice of five ECC'd servers vs 15 non-ECC, and the tangible benefits of more servers is higher than the probabilistic benefits of ECC'd servers.

Why is ECC so expensive? It's not the RAM, which is adds lg(data bits + ecc bits) worth of DRAM: 6 bits for a 32 bit word, 7 bits for a 64 bit line. The percentage of extra ECC bits over DRAM decreases per line width, so as we move to wider memory buses, the incremental $ and W cost of ECC should decrease. Why then the premium? Chipsets and motherboards. ECC is viewed as a premium luxury that isn't even an option in the consumer chipsets, Atom parts in particular. That's just an attempt to gain a price premium on server designs -not in the chipset and mainboard, but on the CPUs itself.

This led into a discussion on what would the ideal Hadoop worker node be. Not an enterprise-priced hot-swap-everything box -that makes sense for the master nodes, but not for the many workers. There are the 'web scale' server boxes that the PC manufacturers will sell you if you have a large enough order. These are the many-disk, nearly no end user maintenance design that you can buy in bulk -but it has to be bulk. They can contain assumptions that they are the sole inhabitants of a rack (things like front-panel networking needing a compatible ToR switch), and they are for organisations whose clusters are so big that disk failures are treated as an ongoing operations task, which can be addressed by decommissioning that node and then replacing the disks at leisure. At scale this is the right tactic -as the decommissioning load is spread across the entire cluster's network bandwidth, it's not an expensive operation LAN-wise.

In smaller clusters you not only lack the spare disk capacity to handle a 24 TB server outage, the impact on your bandwidth is higher. Provided you can power off the server, swap over the disk, have the machine booted, the disk formatted and mounted and Datanode live within 15 minutes, you do a warm swap today. If servers were made for that swap in/out to be easier than with the current web-scale servers, this could/should be possible.

What do we want then?

  • Easy to warm swap disks in and out
  • Not mandate single model /rack -so network ports, PSUs and racks can't require this.
  • ECC memory
  • Not require the latest and best CPUs and chipsets. A bit behind the model curve makes a big difference in ASP.
  • Low peak power budget. I know the I7 chipsets can power off idle cores; even raise the voltage/clock speed of the active cores in such a situation to finish their work harder in the same power envelope. But in a rack you need to consider the whole rack power budget, and how to stay not only within the rack's power budget, but how to stay in the billing range of the colocation site -some of whom bill by peak possible power budget, not actual use.
  • Topology information. Some of the HP racks now do this for the admin tools -be nice to see how to get that into Hadoop.
  • Support 2x1 Gb LAN, LAN with throughput that meets the peak loads of re-replicating an entire lost rack.
  • Option for adding lot more of RAM, without you having to pull out and throw away the RAM it initially came with.
  • Blinking lights on the front that you can control in the user-level software. You know you need it. This may seem facetious but it's how you direct people in the datacentre to the box they need to look at, it's how you qualify the network topology (light up all boxes in row of all racks -boxes that don't light are offline or mis-connected). Plus Thinking Machines shows that Blinky Lights look good at scale.
  • Option to buy compatible CPUs to fill in the spare socket in twelve months time. That means the parts should still be on the price list.
  • Option to buy compatible servers 12 months down the line, where compatible means "same OS", even if storage and CPU may have incremented.
  • CPU parts with excellent Hadoop/Java performance, good native compression and checksumming.
  • Linux support for everything on the motherboard.
  • Options for: GPU, SSD
  • aybe: USB BIOS updates. On a 50 node rack this is just about manageable and means you can skimp on the ILO management board and matching network.
  • Ability to get system state (temp, PSU happiness) into OS and then into the management tools of the cluster -even the Hadoop cluster.

No need for:

  • RAID controllers.
  • Hot swap PSUs, though ability to share peer PSUs is handy. A cold-swap PSU and an on-site spare should suffice, given the rate of PSU failures ought to be much, much less than those of disks.
  • Top of the line CPU parts whose cost ramps up much more than performance verses the previous model.
  • Dedicated management LAN and ILO cards. Nice and makes managing large clusters much easier, but they add cost that small clusters can't justify.
  • Over-expensive interconnect (did anyone say Infiniband?)

I don't know who is doing this yet -if you look at the prepackaged Hadoop stacks they are all existing Hardware. Things will no doubt change -and once that happens, once people starting optimising hardware for Hadoop, we will all get excellent value for money. And Blinking Lights on the front of our boxes that MR jobs can control.

No comments:

Post a Comment

Comments are usually moderated -sorry.