Steve Loughran: 2012-03

The NYT article on US immigration and ESTA makes me think that I should publish the safety lecture from the November 2010 Bristol Hadoop workshop, which was hosted by Bristol University Physics Dept. All signs in the slides are from their physics building, except for the last slide.

The last form is genuine it's hosted on the state department as the 1405-0134 form, I've had to fill it in a couple of times. Things I like about it

I can cross more countries on a one day alpine bike ride than there is room for in the "countries you have visited" section.
Giving to a charity is clearly something they don't expect people to do much of -again, space for two or three, and no time limit on how far back you must list your donations.
In places like Boise, Idaho, having firearms training is something they ought to give visitors, not ask if they have it.

Because form explicitly says "nuclear experience", the HEP folk can get away with saying no. Saying you work with antimatter or neutron beams is not the thing you want to do at an immigration border.

While taking the on-site photos I got cornered by the University site security people for taking pictures with an SLR as it is "what the police warned them about". I didn't point out to them that if I wanted to take photos discreetly I'd use a camera phone or my HD-resolution cycle helmet cam on the bike helmet I'd carry nonchalantly under one arm -as that would only make them think I was planning something. Better to stick to the idea that enemies of the state use SLRs -so anyone with an SLR is potentially an enemy. Anyway, I didn't argue, just sat there, let them look at the photos while verifying that I was visiting the physics dept. At some point the chief minion started talking about deleting one of the paper sign listing how many gigabequerels they had. Ignoring the fact that such info is available online, having that photo deleted would have wasted 15 minutes of my life once I got back home and undeleted it. At least they didn't ask to look at the laptop, as that would have created conflict.

Returning to ESTA, here's something funny about it. There is no online way to see if it expires. Apparently you get a renewal email, but I've never seen one, and last november I tried to see if mine was still valid before flew to the US. I didn't get a reply until after I'd flown out:

Dear Stephen,
I am sorry we were not able to respond to your question sooner. Hopefully you did not have any problems traveling to the US, but please write back if you still need help.

That is -we hope it hadn't expired yet because you'd have been stuffed if it had.

People say "should you run Hadoop in the cloud?". I say "it depends".

I think there is value in Hadoop-in-cloud, I talked about doing it in 2010 at Berlin Buzzwords 2010; since then I've had more experience with using Hadoop and implementing cloud infrastructures.

If your data is stored in a cloud provider's storage infrastructure, doing the analysis locally is the only rational action. It's that "work near the data" philosophy.
If you are only doing some computation -say nightly- then you can rent some cluster time. Even if compute performance is worse, you can just rent some more machines to compensate.
You may be able to achieve better security through isolation of clusters (depends on your IaaS vendor's abilities).
No upfront capex; fund from ongoing revenue.
Easier to expand your cluster; no need to buy more racks, find more rack space.
You don't need to care about the problems of networking.
Less of a problem of heterogenous clusters if you expand later.

Against that

Cost of data storage can only increase at a rate proportional to ingress/retention rates.
Cost of cluster time increases at a rate proportional to analysis performed. There is no "spare cluster time" for low priority work.
Even if CPU time can scale up, IO rate of persistent data may not.
Hadoop contains lots of assumptions about running in a static infrastructure; it's scheduling and recovery algorithms assume this.

Some examples of where Hadoop's assumptions diverge from that of cloud infrastructures:

HDFS assumes failures are independent, and places data accordingly (Google's Availability in Globally Distributed Storage Systems paper shows this doesn't hold in physical infrastructures, my notion of failure topologies expands on that)
MR blacklists failing machines, rather than releasing them and requesting new ones.
Worker nodes handle failure of master nodes by spinning on the hostname, not querying (dynamic) configuration data for new hostnames. Some of the HA HDFS may address that, I'm not tracking it enough.
Topology scripts are static. I've been slowly tweaking topology logic in 0.23+ but haven't put the dynamicness in there yet (HDFS and MR cache (name->rack) mappings on the assumption that the data is coming from slow to exec scripts, not fast & refreshable in-VM data).
Schedulers assume #of machines are static, don't allocate and release compute nodes based on demand and with knowledge of cost and quantum of CPU rentals. (I'm not sure quantum is the right term there, I mean the fact that VMs may be rented by the hour, 15 minutes, etc, so your scheduler should retain them for 59 minutes after acquiring them.
Scheduling doesn't bill different users for their cluster use in a way that is easily mapped to cluster time.

A lot of these are tractable, you just have to put in the effort. The Stratosphere team in Berlin are doing lots of excellent work here, including taking a higher level query language and generating an execution plan that is IaaS aware -you can optimise for fast (many machines) or lower cost (use less machines more efficiently).

In comparison, a physical cluster:

Offers a lower cost/TB of any corporate filestore to date other than people's desktop computers (which have a high TCO and power cost that is generally ignored), so enables you to store lots of stuff you would otherwise discard.
Let's you choose the hardware optimised for your current and predicted workloads.
Has free time for the low priority background work as well as the quicker queries that near-real-time UIs like.
May be directly accessible from desktops in the organisation (depends on security model of cluster).
Is easily hooked up to Jenkins infrastructure for execution of work as CI jobs.
Let's you do fancy tricks like striping of different MR versions across the racks for in-rack locality and different sets of task trackers for foreground vs background work, and different JTs (reduces memory use, cost of failure, etc).
Is way, way easier to hook up to internal databases, log feeds. To do ETL into your corporate oracle servers, you will need to run something behind the firewall to fetch it off the IaaS storage layer, rather than have your reducers push it to the RDBMS itself.

If you are generating data in house, in house clusters make a lot of sense.
This is why I say "it depends" -it depends on where you collect your data and what you plan to do with it.

As for the way Hadoop doesn't currently work so well in such infrastructures, well the code is there for people to fix. It's also a lot easier to test in-cloud behaviour, including resilience to failure, than it is with physical clusters.

[Photo: Descending Mt Ranier, 2000]

Steve Loughran

2012-03-16

Safety in High Energy Physics and US Visitor paperwork

2012-03-14

Hadoop in Cloud Infrastructures