Television Viewing & the Deanonymization of Large Sparse Datasets.

[preamble: this is not me writing against collecting data analysing user behaviour, including Tv viewing actions. I cherish the fact that Netflix recommends different things to different family members, and I'm happy for the iPlayer team to get some generic use data and recognise that nobody actually wants to watch Graham Norton purely from the way that all viewers stop watching before the introductory credits are over. What is important here is that I get things in exchange: suggestions, content. What appears to be going on here is that a device I bought is sending details on TV watching activity so as to better place adverts on a a bit of the screen I paid for, possibly in future even interstitially during the startup of a service like Netflix or iPlayer. I don't appear to have got anything in exchange, and nobody asked me if I wanted the adverts let alone the collection of the details of myself and my family, including an 11 year old child.]

Graham Norton on iPlayer

Just after Christmas I wandered down to Richer Sounds and bought a new TV, first one in a decade, probably second TV we've owned since the late 1980s. My goal was a large monitor with support for free to air DTV and HD DTV, along with the HDMI and RGB ports to plug in useful things, including a (new) PS3 which would run iPlayer and Netflix. I ended up getting a deeply discounted LG Smart TV as the "smart" bits came with the monitor that I wanted.

I covered the experience back in March, where I stated that I felt that smart bit was AOL-like in its collection of icons of things I didn't want and couldn't delete, it's dumbed down versions of Netflix and iPlayer, and its unwanted adverts in the corner. But that's it, the netflix tablet/TV integration compensates for the weak TV interface, and avoids the problem of PS3 access time limits on school nights, as the PS3 can stay hidden until weekends.


Last week I finally acceded to the TV's "new update available" popups, after which came the "reboot your TV" message. Which I did, to then get told that I had to accept an updated privacy policy. I started to look at this, but after screen 4 of 20+ gave up, mentioning it briefly on that social networking stuff (who give me things like Elephant-Bird in exchange for their logging my volunteered access -access where I turn off location notification in all devices).

I did later regret not capturing that entire privacy policy by camera, and tried to see if I could find it on line, but at the time, the search term "LG SmartTV privacy policy" returned next to nothing apart from a really good policy for the LG UK web site, which even goes into the detail of identifying each cookie and its role. I couldn't see the policy after a quick perusal of the TV menus, so that was it.

Only a few days later, Libby Miller pointed me at an article by DoctorBeet, who'd spun wireshark up to listen to what the TV was saying, and so showing how his LG TV is doing an HTTP forms  POST to a remote site of every channel change, as well as details on filenames in USB sticks.

This is a pretty serious change on what a normal television does. DoctorBeet went further and looked at why. Primarily it appears to be for advert placement, including in that corner of the "smart" portal, or a start time after you select "premium" content like iPlayer or netflix. I haven't seen that which is good -an extra 1.5MB download for an advert I'd have to stare through is not something I'd have been happy with.

Anyway, go look at his article, or even a captured request.

I'm thinking of setting up wireshark to do the same for an evening. I made an attempt yesterday but as the TV is CAT-5 to a 1Gbs hub, then an ether over power bridge to get into the base station, it's harder than I'd thought. My entire wired network is on switched ports so I can't packet sniff, and the 100 MB/s hub I dredged up from the loft turned out to be switched too. That means I'd have to do something innovative like use the WEP-only 802.11b ether to wifi bridge I also found in that box, hooked up to an open wifi base station plugged into the real router. Maybe at the weekend. A couple of days logs would actually be an interesting dataset even if it just logs PS3 activity hours as time-on-HDMI-port-1

What I did do is go to the "opt out of adverts" settings page DoctorBeet had found, scrolled down and eventually followed some legal info link to get back to the privacy settings. Which I did photo this time, and which are now up on Flickr.

Some key points of this policy

Information considered to be non personally identifiable include MAC addresses and "information about the live content you are watching"

LG Smart TV Privacy Policy

That's an interesting concept, which I will get back to. for now. note that that specific phrase is not indexed anywhere into BigTable, implying it is not published anywhere that google can index it.
Phrase not found: "information about the live content you are watching"

Or "until you sit through every page with a camera this policy doesn't get out much"

If you have issues, don't use the television

LG Smart TV Privacy Policy

That's at least consistent with customer support.

Anyway. there's a lot more slides. One of them gives a contact, who when  you tap in to LinkedIn not only shows that he's the head of legal at LGE UK,  that he's one hop away from me: datamining in action.

Now, returning to a key point: Is TV channel data Non-personal information?

Alternatively: If I had the TV viewing data of a large proportion of a country, how would I deanonymize it?

The answer there is straightforward, I'd use the work of [2004 Arvind Narayanan and Vitaly Shmatikov], Robust De-anonymization of Large Sparse Datasets.

In that seminal paper, Narayanan and Shmatikov took the anonymized Netflix dataset of (viewers->(movies, rankings)+), and deanonymized it by comparing film reviews on Netflix with IMDb reviews, looking for reviews that appeared on IMDb shortly after a Netflix review with ratings matching/close to that a Netflix review. They then took the sequence of a viewers' watched movies and looked to see if a large set of their Netflix review met that match critera. At the end of which they managed to deanonymize some Netflix viewers -correlating them with an IMDb reviewer may standard deviations out from from any other candidate. They could then use this  match to identify those movies which the viewer had seen and yet not reviewed on IMDb.

The authors had some advantages, both netflix and IMDb had reviews, albeit on a different scale. the TV details don't so the process would be more ad-hoc

  1. Discard all events that aren't movies
  2. Assume that anything where the user comes in late to some threshold isn't a significant "watch event" and discard.
  3. Assume that anything where the user watches all the way to the end is a significant "watch event" and may be reviewed later.
  4. Assume that watching events where the viewer changes channel some distance into a movie -say 20 min- as a significant watch failure event, which may be reviewed negatively.
  5. Consider watch events where the user was on the same channel for some time before the movie began as less significant than when they tuned in early.
  6. If information is collected when a user explicitly records a movie, a "recording event", that is treated even more significantly.
  7. Go through the IMDb data looking for any reviews appearing a short time after a significant set of watch events, expecting higher ratings from significant watch events and recording events, and potentially low ratings from a significant watch failure.

I don't know how many matches you'd get here -as the paper shows, it's the real outliers you find, especially the watchers of obscure content.

Even so, the fact that it is would to possible to identify at least one viewer this way shows that TV watching data is personal information. And I'm confident that it can be done, based on the maths and the specific example in the Robust De-anonymization of Large Sparse Datasets paper.

Conclusion: irrespective of the cookie debate, TV watching data may be personal -so the entire dataset of individual users must be treated this way, with all the restrictions on EU use of personal data, and the rights of those of us with a television.


Foreign News

The cracks all the way to the top of the small feudal island-state of Great Britain became visible this week, as a show trial and exposure of police and state security activities exposed the means the regime retains power.

Stokes Croft Royal Wedding Day

For centuries Britain has endured a caste system, where those at the bottom had little education or career prospects, while those in the ruling "upper class" lived an entirely separate life -a life that began with a segregated education from their school, "eton", to their universities, oxford and cambridge and then employment in "the city" or political power in "parliament". Similar to the French Polytechniques system, while it guarantees uniformity and consistency amongst the hereditary rulers, the lack of diversity reduces adaptability. Thus the elite of this island have had trouble leading it out of the crises that have befallen it since 2008 -when it became clear that it offshore tax-haven financial system had outgrown the rest of the country. The emergency measures taken after the near-collapse of the countries economy have worsened the lives for all outside a small elite -exacerbating the risks of instability.

This month some of the curtains on the inner dealings of that ruling oligarchy were lifted, giving the rest of the country a glimpse into the corrupt life of the few. A show trial of the editors of a newspaper showed how the media channels -owned by a few offshore corporations- were granted free reign by the rulers, in exchange for providing the politicians with their support and the repetition of a message that placed the blame for the economic woes on the previous administration and outgroups such as asylum seekers and "welfare scroungers".

A disclosure of how the media were creating stories based on intercepting the voicemail messages of anyone of interest forced the government to hand a few of the guilty to the legal system -while hoping that the intiminate relationship between these newspaper editors and those in government do not get emphasised. Even so, this scandal has already forced the government to postpone approving a transaction that would give a single foreign oligarch, Murdoch, near absolute control of television and the press. Public clamour for some form of regulation of the press has also forced the regime to -reluctantly- add some statuatory limitations to their actions. It remains to be see what effect this has -and whether the press will exact their revenge on the country's rulers.

A few miles away, in the country's "parliament", the MPs exercised some of their few remaining privileges of oversight. The "plebgate" affair represented a case in which the feared police, "the Met" were grilled over their actions. Normally the Met is given a free hand to suppress dissent and ensure stability across the lower castes, but in "plebgate" the police were caught on CCTV and audio recordings making false accusations about one of the rulers. The thought that "the Met" could turn on their masters clearly terrifies them: the grilling of the police chiefs represents the public part of a power struggle to define who exactly is in charge.

Alongside this, the heads of the state security apparatus were interviewed over the increasingly embarrasing revelations that they had been intercept the electronic communications of the populace of the country, "the subjects" as they are known. This comes as no surprise to the rulers, who recognise that with the mainstream media being part of the oligarchy, any form of organised dissent will be online. Monitoring of facebook and google is part of this -during the 2011 civil unrest, calls even were made by the press and politicians to disable some of these communications channels. Again, the rulers have to walk a fine line between appearing concerned about these revelations, while avoiding worsening those relationships which are critical for keeping the small hereditary elite in power.

Given the interdepencencies between the rulers, the press and the state security forces, no doubt these cracks will soon be painted over. Even so, irrespective of the public facade, it may be a while before the different parts of what is termed "the establishment" trust each other again.


Maverick and Applications


One action this week was a full OS/X roll on three boxes; fairly event free. The main feature for me is "better multiscreen support". There's also now an OS/X version of Linux's Powertop; as with powertop is more of a developer "your app is killing the battery" than something end users can actually do anything with -other than complain.

The other big change is to Safari, but as I don't use that, it's moot.

The fact that its a free upgrade is interesting -and with Safari being a centrepiece of that upgrade, maybe the goal of the upgrade is to accelerate adoption of the latest Safari and stop people using Firefox & Chrome. The more market share in browsers you have, the more web sites work in it -and as Safari is only used on a macs, it can't have more desktop browser market share than the market share apple have in the PC business itself. A better Safari could maximise that market share -while its emphasis on integration with iPads and iPhones rewards people who live in the single-vendor-device space, making us owners of Android phones feeling left out.

One offering that did get headlines was "Free iWork", but that turns out to be "Free on new systems"; if you have an existing OS/X box, you get to pay $20 or so per app -same as before.

Except, if you go to the apple app store, the reviews of the new suite from existing users are pretty negative: dumbed down to the point where users with existing spreadsheets, documents and presentations are finding things missing -where in keynote a lot of the fancy "make your presentations look impressive" features are gone.

They're not going to come back, now that iWork is a freebie.

If the NRE costs of maintaining iWork are now part of the cost of the Mac -and OS upgrades are going the same way. Even if apple maintain market share and ASP margins over Windows PCs, the software stack costs have just gone up.

Which means those applications have gone from "premium applications with revenue through the app store", with a business plan of "be compelling enough to sell new copies as well as regular upgrades from out existing customer base", to "bundled stuff to justify the premium cost of our machines".

That's a big difference, and I don't see it being a driver for the iWork suite being enhanced with features more compelling to the experts.

Where apple are likely to go is cross-device and apple cloud integration, again to reward the faithful single-vendor customers. Indeed, you do get the free apple iCloud versions of the iWork apps, which look nice on Safari -obviously. Apple's business model there: upsell storage, does depend on storage demand, but the harsh truth is, it needs a lot of documents to use up the 4GB of free storage. Photographs, now, they do take up space, which clearly explains why the new iPhoto has put work in iPhoto to iCloud sharing. Yet it does still retain Flickr sharing, which, with 1TB of storage, must be a competitor to iCloud for public photos, while facebook remains a destination for private pics.

I wonder whether that Flickr uploader will still be there the next time Apple push out a free update to the OS and applications
[photo: a line of Tuk Tuks, Tanzania/Kenya Border]


Hadoop 2: shipping!

Deamz & Soker: Asterix

The week-long vote is in- Hadoop 2 is now officially released by Apache!

Anyone who wants to use this release should download Hadoop via the Apache Mirrors.

Maven and Ivy users: the version you want to refer to is 2.2.0

The artifacts haven't trickled through to the public repos as of 2013-10-16-11:18 GMT -they are on the ASF staging repo and I've been using them happily all week.

This release marks an epic of development, with YARN being a fundamental rethink of what you can run in a Hadoop cluster: anything you can get to run in a distributed cluster where failures will happen, the Hadoop FileSystem the API for filesystem access  -be it in Java or a native client- and data is measured by the Petabyte.

YARN is going to get a lot of press for the way it transforms what you can do in the cluster, but HDFS itself has changed a lot. This is the first ASF release with active/passive HA -which is why Zookeeper is now on the classpath. CDH 4.x shipped with an earlier version of this- and as we haven't heard of any dramatic data loss events, consider it well tested in the field. Admittedly, if you misconfigure things failover may not happen, but that's something you can qualify with a kill -9 of the active namenode service. Do remember to have to >1 zookeeper instance before you try this -testing ZK failure should also be a qualification process. I think 5 is a number considered safer than 3, though I've heard of one cluster running with 9. Nobody has admitted going up to 11.

This release also adds to HDFS
  1. NFS support: you can mount the FS as an NFS v3 filesystem. This doesn't give you the ability to write to anywhere other than the tail of a file -HFDS is still not-Posix. But then neither is NFS: its caching means that there is a few seconds worth of eventual consistency across nodes (\cite{Distributed Systems, Colouris, Dollimore & Kindberg, p331}).,
  2. Snapshots: you can snapshot some of a filesystem and roll back to it later. Judging by the JIRAs, quotas get quite complex there. What it does mean is that it is harder to lose data by accidental rm -rf operations.
  3. HDFS federation: datanodes can store data for different HDFS namenodes, -Block Storage is now a service- while clients can mount different HDFS filesystems to get access to the data. This is something of primarily of relevance to people working at Yahoo! and facebook scale -everyone else can just get more RAM for their NN and tune the GC options to not lock the server too much]
Hadoop 2 also adds is extensive testing all the way up the stack. In particular, there's a new HBase release coming out soon -hopefully HBase 0.96 will be out in days. Lots of other things have been tested against it -which has helped to identify any incompatibilities between the Hadoop 1.x MapReduce API (MRv1) and Hadoop 2's MRv2, while also getting patches into the the rest of the stack where appropriate. As new releases trickle out, everything will end up being built and qualified on Hadoop 2.

Which is why when you look at the features in Hadoop 2.x, as well as headline items "YARN, HDFS Snapshots, ...", you should also consider the testing and QA that went into this -this is the first stable Hadoop 2 release -the first one extensively tested all the way up the stack. Which is why everyone doing that QA -my colleagues, Matt Foley's QA team, the Bigtop developers, and anyone else working with Hadoop that built and tested their code against Hadoop 2.1 beta and later RCs -and reported bugs.

QA teams: your work is appreciated! Take the rest of the week off!

[photo: Deamz and Soker at St Philips, near Old Market]


Hadoop: going way beyond Google


Lars George has some nice slides up on Hadoop and its futures Hadoop is dead, long live Hadoop! -but it shows a slide that I've seen before and it concerns me

It's on p39 where's a list of papers published by google tying them in to Hadoop projects, implying that all Hadoop is is a rewrite of their work. While I love Google Research papers, they make great reads, we need to move beyond what Google's published work, because that strategy has a number flaws

Time: with the lag from google using to writing about it being 2+ years, plus the time to read and redo the work, that's a 4 year gap. Hardware has moved on, the world may be different. Googles assumptions and requirements need to be reassessed before redoing their work. That's if it works -the original MR paper elided some details needed to get it to actually work [1].

Relevance outside google. Google operate -and will continue to operate- at a scale beyond everyone else. They are doing things at the engineering level -extra checksums on all IPCs- because at their scale the probability of bit errors sneaking past the low-level checksums is tangible. Their Spanner system implements cross-datacentre transactions through the use of GPS to bound time. The only other people doing that in public are telcos who are trying to choreograph time over their network.

Its good that google are far ahead -that does help deal with the time lag of implementing variants of their work. When the first GFS and MR papers came out, the RDBMS vendors may have looked and said "not relevant", now they will have meetings and presentations on "what to do about Hadoop". No, what I'm more worried about is whether there's a risk that things are diverging -and its cloud/VM hosting that's a key one. The public cloud infrastructures: AWS, Azure, Rackspace, show a different model of working -and Netflix have shown how it changes how you view applications. That cloud is a different world view: storage, compute, network are all billable resources.

Hadoop in production works best on physical hardware, to keep costs of storage and networking so low -and because if you try hard you can keep that cluster busy, especially with YARN to run interesting applications. Even so we all use cloud infras too because they are convenient. I have a little one-node Hadoop 2.1 cluster on a linux VM so I can do small-scale Hoya functional tests even when offline. I have an intermittent VM on rackspace so I can test the Swift FileSystem code over there. And if you do look at what Netflix open source, they've embraced that VM-on-demand architecture to scale their clusters up and down on demand.

It's the same in enterprises: as well as the embedded VMWare installed base, OpenStack is getting some traction, and then there is EC2, where dev teams can implement applications under the radar of ops and IT. Naughty, but as convenient as getting a desktop PC was in the 1980s. What does it mean? It means that Hadoop has to work well in such environments, even if virtual disk IO suffers and storage gets more complex.

Other work. Lots of other people have done interesting work. If you look at Tez, it clearly looks at the Dryad from MS Research. But there's also some opportunities to learn the Stratosphere project, that assume a VM infrastructure from the beginning -and build their query plans around that.

Google don't talk about cluster operations.

This is really important. All the stuff about running google's clusters are barely hinted at most of the time. To be fair, neither does any one else, it's "the first rule of the petabyte club".

[Boukharos08] GCL Viewer - A study in improving the understanding of GCL programs. This is a slightly blacked out paper discussing Google's General Config Language, a language with some similarities to SmartFrog: templates, cross-references, enough operations to make it hard to statically resolve, debugging pain. That paper looks at the language -what's not covered is the runtime. What takes those definitions and converts it to deployed applications, applications that may now span datacentres?

[Schwarzkopf13]: Omega: flexible, scalable schedulers for large compute clusters. This discusses SLA-driven scheduling in large clusters, and comes with some some nice slides.

The Omega paper looks at the challenge of scheduling mixed workloads in the same cluster, short-lived analytics queries, longer processing operations: PageRank &c, and latency-sensitive services. Which of course is exactly where we are going with YARN -indeed, Schwarzkopf et al [2] cover YARN, noting that it will need to support more complex resource allocations than just RAM. Which is of course, exactly where we are going. But what the Omega paper doesn't do is provide any easy answers -I don't think there are any, and if I'm mistaken then none of the paper's authors have mentioned it [3].

Comparing YARN to Omega, yes, Omega is ahead. But being ahead is subtly different from being radically new: the challenge of scheduling mixed workloads is the new problem for us all -which is why I'm not only excited to see a YARN paper accepted in a conference, I'm delighted to see mentions of Hoya in it. Because I want the next Google papers on cluster scheduling to talk about Hoya [4].

Even so: scheduling is only part of the problem: Management of a set of applications with O(1) ops scaling is the other [5]. That is a secret that doesn't covered, and while it isn't as exciting as new programming paradigms for distributed programming [6] it is as utterly critical to datacentre-scale systems as those programming paradigms and the execution and storage systems to run them [7].

Where else can the Hadoop stack innovate? I think the top of the stack -driven by application need is key -with the new layers driven by the new applications: the new real-world data sources, the new web applications, the new uses of data. There's also the tools and applications for making use of all this data that's being collected and analysed. That's where a lot of innovation is taking place -but outside of twitter, LinkedIn and Netflix, there's not much in the way of public discussion of them or sharing of the source code. I think companies need to recognise the benefits of opening up your application infrastructure (albeit not the algorithms, datasets or applications), and get it out before other people open up competitive alternatives that reduce the relevance of their own project.

[1] I only wrote that to make use of the word elision.
[2] The elison of the other authors to "et al" is why you need 3+ people on a paper if you are the first author.
[3] This not because I'm utterly unknown to all of them and my emails are being filtered out as if they were 419 scams. I know John from our days at HP Labs.
[4] Even if to call me out for being mistaken and vow never to speak to me.
[5] Where the 1 in the O(1) has a name like Wittenauer.
[6] Or just getting SQL to work across more boxes.
[7] Note that Hortonworks is hiring, we'd love people to come and join us on these problems in the context of Hadoop -and unlike Google you get to share both your work and your code.

[Photo: Sepr on Lower Cheltenham Place, Montpelier]


I for one will not defend our nation's critical infrastructure home wifi base stations

Apparently the UK government is planning to spend lots of money building a "Cyber National Guard", which means that a "reserve guard" of civilians will be available on call when the military needs them to organise "cyber strikes" against enemies or defend the national infrastructure.

Or as the Daily Mail (remember, it's not a real paper and has a history of supporting fascism) says:

A new ‘Cyber National Guard’ of part-time reservists will be open to computer whizzkids who cannot pass the current Territorial Army fitness tests, on the basis that press-ups do not aid computer skills. ‘A TA for computer geniuses’, as Mr Hammond called it.

He poured scorn on ‘crude and bonkers attacks by armchair generals’ who have criticised him for cutting the number of soldiers – and made it clear conventional forces faced more cuts in the switch.

Moon Street

In theory I meet some of the criteria, even if I am fitter than the sneering prejudices of the Daily Mail would think.

However, this does not make me suitable for some national-cyber-guard-thingy because if you look at the one documented instance of a nation state committing a (peacetime) over-the-net attack on another nation state, Olympic Games, it's clear that this project took person-years of effort to come up with a virus so subtle it could make use of multiple 0-day exploits to get into Windows, then trickle over to the SCADA-managed industrial machinery by way of USB sticks that were neither near empty or near full (to make the loss in capacity less obvious). Once there, recognise the characteristics of an iranian enrichment centrifuge, change their spin rates to destroy them -all the while reporting valid parameters to the ops team. That's not a activity of some weekend developers: That is the R&D spend of a goverment, the integration of gained knowledge of the Iranian enrichment process, the ability to write code to destroy it -the testing of that on real hardware, and the transport mechanism using 0-day exploits and a forged signing certificate. That is what the future of inter-nation-state conflict over the net looks like, and it doesn't depend on script-kiddies running metasploit.

The classic reserve forces model trains people, part time, to be soldiers, a process which roughly consists of
  1. learning how to avoid getting killed.
  2. learning how to shoot at things and people
  3. learning how to follow orders even when the outcome involves a non-0 zero probability of becoming a KSI statistic.
  4. learning how to use skills 1 & 2 to achieve goals the person shouting the orders wants to achieve.

That training and learning scales well, as shown in the twentieth century by two global conflicts and the Korean war. Teaching people how to code malware to infiltrate and damage other government's national and commercial infrastructure does not. What does that leave? Botnets are designed to be O(1) scaling, so you don't need regiments of engineers there. Unless it is just script-kiddie work rummaging around opposing computing facilities -but that's something best done in peacetime, to a relaxed schedule (which is presumably why the Chinese govt. do appear to have associates doing that).

As for myself, unless I am needed to write JUnit tests to break the North Korean missile program, well, I'm not useful.

Maybe, therefore, it's not military attack the army wants, it's defending the nation's critical infrastructure.

Which is what exactly?

Because if its the network, then what you need is not on skills in things like configuring telco-scale switches, it's having that network set up with as much intrusion and anomaly detection as possible, with the people on call to handle the support calls. You aren't going to be able keep part time computer people around to field calls on that if all they know about network security is that turning off UPNP on the home router is a good idea.

No, network management at the government scale is a skill for the few, not the many. That doesn't mean that those people who do have to look after corporate and national intranets shouldn't be up to date with current tools and thinking w.r.t. defending critical network security from nation states, of which a key one is don't buy routers from a company owned by the Chinese Army.

Beyond the network, well there's the many desktops out there. I can just about defend my home set of machines by Romanian Botnet gangs, primarily by disabling most browser plugins, updating flash weekly and not running code I don't trust. I also have to defend my home passwords from an 11 year old. Neither skill will give me a chance to defend my machines against a nation state, not unless the attack by the state in question involves a small boy looking over your shoulder as you type in the password to give him extra time on the machine. In that case -and only in that case- would my advice -use some uppercase characters and learn to touch type- would be of use.

But going round locking down PCs by deleting security risks like Acroread? Enforcing password policies? These are IT dept things, not something for a team of reservist-cyber-warriors. Even there though, the threat posed by foreign governments sending spear-phished PPT documents containing ActiveX controls with 0-day exploits in them (ActiveX is derived from Ole Control Extensions which is built on top of Object Linking and Embedding, all atop the COM common object model). Once the unsuspecting slideware gets opened, whatever payload it goes on to download.

Where does that leave? If the outer network is the netops, the deskopt the PC IT dept, what's left? The applications.

It means designing and building applications that don't let you steal millons if you attach a KVM switch to an employee's desktop.

It means designing databases apps so that SQL injection attacks never work, irrespective of how the input comes in, and validation of that data at entry time, in flight and when it is stored in the database -so that corruption is detected.

It means having some way of backing up application state securely, so that if damage is done to it, then you can recover.

It means thinking about security in a more complex way than generic "user" -with different levels of access to different things, and it means having a defensible audit trail so that if someone were to download large quantities of an organisations files and stick them up
on wikileaks or hand them to a paper -at least you know what's been taken.

From that perspective, people like myself are more generally useful if we do actually make the things we code as securely as possible, and put in audit trails on the off chance that we can't. And it implies that the government would be better of spending money teaching us to understand Kerberos as well as other aspects of secure computing, rather than having some pool of script-kiddies whose skill stops at metaspoit and wireshark, and whose networking understanding stops at home routers. Of course there the problem lies that some of the course matter needed to keep those foreign nation states at bay, "don't trust HTTPS or SSH with 1024 bit keys" may start the audience asking questions you don't want asked.

Overall, these proposals -as presented in the newspapers- appear to be naively unrealistic and based on a complete misunderstanding of how you can both attack and defend core computing infrastructures -as well as who is capable and responsible for doing so.

Why then do they appear?

I can see a number of possibilities
  1. Null Hypothesis: Politican is clueless, and said things they don't understand;
  2. Politician reported someting realistic, but reporter clueless and reached a conclusion that makes no sense
  3. Both Politican and Reporter clueless leading to a complete misunderstanding, where the reporter didn't realise
  4. the politician was talking bollocks and then exaggerated even that
  5. Politician knew that what he was talking about was utterly unrealistic, but did it in an attempt to distract the audience and justify other actions

Given the structure of the talk -to offset cuts in the core "shoot things, achieve goals ordered to achieve, come back alive" bit of the army and reserve forces -he's probably doing #5: divert and justify those actions. But the scary thing is: he may actually believe what he's been saying.


How to update an ubuntu 12.x box to protoc 2.5.0 and so build Apache hadoop 2.x+

Deamz @ Mina Road

Here are some working notes on how to push up an Ubuntu 12.x machine to a version of protoc that builds Hadoop. It is a shame that protobuf.jar doesn't include a protoc compiler
Actually it's also a shame that protobuf.jar isn't backwards compatible across versions. I'm
now ooking at Apache Thrift and thinking "maybe I should adopt that in my own code".

Run these as root:
cd ~
apt-get -y remove protobuf-compiler
curl -# -O https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
gunzip protobuf-2.5.0.tar.gz 
tar -xvf protobuf-2.5.0.tar 
cd protobuf-2.5.0
./configure --prefix=/usr
make install
protoc --version

Then log in as you and again run

protoc --version

The output of the --version commands must be "libprotoc 2.5.0". Any error; any different version: your build of Hadoop won't.


Hoya: coding and testing

Mission District Streets

I've been back in the UK this month, and am busy working on Hoya. The key activity right now is adding support for Apache Accumulo. This is because it forces me to make the changes sooner rather than later to go multi-application; to make Hoya more generic.

This means I've had to go from a simple model of (master, worker+) to a multi-role world one of (master, tablets+, [monitor], [gc]).

A two-role model is straightforward: the AM forks off the master, while all the other containers are automatically workers. Compare the no. of containers you have to the desired number, and request/release containers. When you get given containers, or one fails, you know its always a worker, so update the numbers. Flexing is achieved by having the client specify a new count of workers. Simple.

Multi-role means that not only do you have to have more complex data structures, you need to map containers and container requests to specific roles. That means when the RM notifies the AM of some new container assignments, the AM has to decide what they are for. The trick there -as suggested by one of my colleagues who I shall not name- is to use the priority:int field in the request, mapping that to role numbers. The AM then needs to spin up the relevant launcher for that role, and record details about the container so that on any failure notification it knows which role's node count to decrease. The AM then needs to scan all the role set to decide whether the current state is good, and how to react if not. There's also the detail of configuration; you have to move from simple options for, say, --workers 4 --workerheap 512M, to per role numbers and  options: --role worker 4  --roleopt worker jvm.heapsize 512M -role monitor 1 --roleopt monitor jvm.heapsize 1G

This gets turned into a set of role option maps, which have to be used to configure the specific roles.

I've also had to change how the client flexed the cluster size -from a simple int value to a whole new JSON cluster specification. Which leads to an interesting trick: you can not only tune the size of a cluster by role -you can change parameters like heapsize as you go.

Multi-role then: lots more complexity in managing things.

The other bit of complexity is of course deploying different application types. Here I've introduced the notion of a Provider; something that provides client-side setup services: build initial json spec, preflight checking of specifications, helping to set up the AM with extra resources, and patching configuration directories.

Server side, these providers now need to help manage the roles, with in-VM startup of the master process, and launcher thread startup of all the remote processes. This is still ongoing work. I did have accumulo execing, though its startup script needs to take the env variables of HADOOP_PREFIX and ZOOKEEPER_PREFIX as a preamble to its own classloading, which is something I may have to bypass before long on the basis it is trying to be too clever. But bring up the Accumulo master and the first thing it says is "init me". I may patch accumulo to add a special "--init" option here, but to keep the #of changes down, and support similar use cases, I've instead gone from a single executable to the ability to run a series of programs in the AMs container. And how have I ended up implementing that? With YARN services.

I've added a new YARN service, ForkedProcessService, which executes a native process when started, stopping itself when that process finishes -and converting a non-zero exit code into service failure with an ExitException with the relevant exit code. Any YARN service that can host children -of which the Hoya AM is one- can create one of these services, add it as a child, configure it and start it running. Registering as a service lifecycle listener lets this parent service get told when the child has finished, and can react to it failing and succeeding.

That works for HBase, with one process "bin/hbase master", but not for Accumulo's potential workflow of [accumulo init, accumulo master]. For that I've had to do another service, SequenceService, which runs its children in sequential order, only starting one when the previous one finishes successfully. Then I have my Provider Services extend that, to get the ability to exec a sequence of commands.

List<String> commands =
  buildProcessCommand(cd, confDir, env, masterCommand);

ForkedProcessService masterProcess = buildProcess(getName(), env, commands);
CompoundService compound = new CompoundService(getName());
compound.addService(new EventNotifyingService(execInProgress,

The AM can now deploy a provider's in-AM role by telling the provider service to start() and let it sort the details out for itself. Which is pretty slick: it's services through Hoya, out-of-VM processes for the real work. To achieve this I've had to add two new container services, SequenceService: runs a list of services in order, failing when any of the child services file, stopping when they are all done. CompoundService runs a set of services in parallel, again propagating failures. Once all children have stopped, this parent service does too.

Comparing it to SmartFrog's workflow components, there are obvious similarities, though without the issues that component always had -because child services were potentially remote and only accessible via RMI, you couldn't interact with a stopped service, while the deployment model mostly mandated that everything was done declaratively at deploy time. In Hoya all the wiring up of services is done in-code; there's no attempt to be more declarative, and with every service running in-process, keeping an eye on them is much easier. That said, I'm having to do things like pass in callback interfaces -the execInProgress interface is implemented by the AM to let it know that the master process is going live. in SF I'd have stuck a reference to an exported RMI interface into the distributed component description graph, and let this workflow service pull it down. Still, here we avoid RMI -which is not a bad thing.

The other difference is that I am not trying to run any of these service components on any node in the cluster other than the single HoyaApplication Master. The goal is to only run on each allocated container the final program Hoya is managing, not any Hoya code at all. YARN does the work of getting the binaries over, starting the process and reporting failure. That assumption of a failure-aware execution layer atop a shared filesystem, with Zookeeper to provide consistent distributed state, means that I can delegate all those details to Hadoop.

Furthermore, I'm only trying to deploy applications that live in a Hadoop cluster -use HDFS for persistent state over a local filesystem, dynamically locate other parts of the application via ZK, and don't treat unannounced termination of a process or loss of a node as a disaster, more a mild inconvenience. This is precisely what today's classic Enterprise Java & Spring Server apps don't do: they use databases, message queues and often build-time URLs -if not build time, then certainly static for the lifespan of the program.

Handling those applications is a pain, because your deployment framework has to choreograph startup, wiring together distributed components of the system from dynamically built up hostnames and URLs of other component locations, try to execute the (expected) shutdown routines rather than just kill the processes, and, hardest of all, deal with the persistent data problem. Usually that's addressed by having a shared database, with the MQ to route stuff to services

Testing YARN applications

One thing I should discuss is the challenge of testing YARN applications, which I've now documented a bit on the wiki
  1. MiniYARNCluster is your friend
  2. MiniHDFS cluster loses logs when torn down, but is handy to keep you honest -make sure some tests use it for YARN deployment
  3. Linux VMs are more relalistic than an OS/X desktop, though there is the overhead of getting protoc updated. Once you've done it -snapshot it.
I now have a VM running in Rackspace Cloud that exists to do the Linux side D/L and test of Hoya and other Hadoop things -I just push stuff to github when I need to, and (currently manually) trigger a git pull and rebuild. The thought of automating all of that and going to jenkins has occurred to me -its the hassle of setting up the network proxies that puts me off -and the challenge of getting all that log data that doesn't get merged in with the the Log4J output.

I have thought of how to merge in the log output of the Hoya JVMs running in YARN containers. These JVMs generate data using the commons-logging API (Hadoop-internal) and SLF4J (Hoya), both of which are backed by Log4J. All I need to do is get that Log4J data merged with the output of the Junit JVM.

Merge post-test-run

Output all log4j records in the JSON format our little JSON log emitter then merge the VM results back in with the JUnit JVM. The clocks on all the logs will be consistent (it's the same host/VM), so they can be strictly ordered (I often end up using timestamps to compare what's going on in the RM & Test runner with the AM). The output would then need to be reworked so that the log entries are presented in a human readable format, instead of one JSON record per (long) line, with a process identifier included on each line.

Stream live to the test runner
Use a Log4J back end that forwards log events to a remote process, perhaps the test runner JVM itself. The collector would then output the Log events intermixed with the local events. Again, we need to differentiate the processes.

Back when I wrote the SmartFrog distributed xUnit test runner (JUnit3, SFUnit, ...), we did forward test results over the wire to a collector service, which could then aggregate test results from processes - similar to the work in [Duarte06]  . In this design we were running tests against multiple remote implementations, showing the results per test, rather than per implementation, so highlighting where things behaved differently. We also streamed out the results as XHTML, emitting the summary record at the end, rather than trying to buffer everything in the JVM until the end of the run. That in-JVM buffering is needed so as the classic Ant XML format sticks the summary in the attributes of the XML document. As a result, if the JVM dies: no output. If the test run generates too much log data: OOM and no output. Streaming XHTML means that you don't OOM from lots of log data, and if the process dies, your browser can still make a lot of the details. Oh, and with the right style sheet the results were directly viewable in the browser without the overhead of the XSL transform (though that always helps with summary data).

What, then, would I like? But which I don't have time to implement.
  1. Test runner to collect log events from multiple processes
  2. Test runner to stream these events out to a file, preserving log level and adding host information
  3. Test runner to print to stdout some feedback on what is going on so that someone sitting in front of a screen can see what is happening. This should be in a format for humans
  4. Something to take the streamed output and generate the traditional XML format for the benefit of Jenkins and other CI tools
  5. Something to take that streamed output and generate reports that can intermix the log events from different processes, and, by virtue of the retained log level data, let me show/hide different log levels, so that I can say "show the warnings", leaving info and debug hidden.
This would be quite an undertaking: I actually wonder if you could have some fun with the Kafka back end for Log4J and doing the analysis in Samza. That would be a bit of overhead on a single desktop though -but it could scale well to production tests. The alternative is have the Test Runner open some network port (RESTy/Post over Hadoop RPC), with log4J posting them in direct, runner to output to file as the JSON records; push them to screen using human-log4j, abusing the thread-ID field to identify remote processes. Then add some post-test run code for the generation of classic XML output, as well as something designed to display multi-process, multi-host test results better. Ideally, have Jenkins display that too.

Like I said: no time. But maybe next year I'll mentor a Google Summer of Code project to do this -it's a great cross-Apache bit of work, with Jenkins and possibly Junit & TestNG in the mix. Target Hadoop, Bigtop and other bits of the Hadoop stack. Yes, a nice project: volunteers with free time invited.

[Photo: somewhere in SF's Mission District: like Stokes Croft except with high-end furniture shops and the graffiti is mostly on the side streets, rather than integrated with the shops]


HDP-2.0 Beta

Hot on the heels of the Apache Hadoop 2.1 beta comes the beta of Hortonworks Data Platform 2.0, which is uses those Hadoop 2.1 binaries as the foundation of the platform.


HBase 0.96 is in there, which is the protobuf 2.5-compatible version, as is Hadoop 2.1-beta. It won't be appreciated to the outside world how traumatic the upgrade from protocol buffers to from 2.4 to 2.5 was, but it was, believe me. The wire format may be designed to be forward- and backward- compatible, but the protobuf.jar files, and java code generated by protoc, is neither. Everything had to be moved forwards to protobuf 2.5 in one nearly-simultaneous move, which, for us coding against hadoop branch-2.1 and HBase 0.96 meant a few days of suffering. Or more to the point: a few days holding our versions locked down. On slide 7 of my HBase HUG Hoya talk I give a single line to build HBase for Hoya "without any changes"

As the audience may recall, I diverged into the fact that this only worked on specific versions of Hadoop 2.1, and, if you had Maven 3.1 installed, by explicitly skipping any site targets when building the tarballs. Even then, I didn't go near the how to upgrade protoc on ubuntu 12.x either.

Anyway, it's done, we can now all use protbuf 2.5 and pretend it wasn't an exercise in shared suffering that went all the way up the stack, all a week before the 2.1 beta came out.

As I was in the US at the time all this took place, I really got to appreciate how hard it is to get a completely tested Hadoop stack out the door -identifying integration problems between versions -and addressing them, testing for scale and performance on our test clusters. Everyone: QA, development and project management were working really hard to get this all out the door, and they deserve recognition as well as a couple of days rest.  It also gives me a hint of how hard it must be for a Linux vendor to get that Linux stack out the door -as things are even more diverse there. It shows why RHEL own the enterprise Linux business -and why they don't rush to push out leading-edg

Returning to the HDP-2 beta, there's a lot of stuff in the product: now that YARN is transforming a Hadoop cluster in a data-centric platform for executing a whole set of applications, things are getting even more exciting:  download HDP-2.0 beta and see.

As for me, I note a mention of Hoya in the blog posting, as well as the way HBase appears above and not alongside YARN, which is a bit of a premature layout -either way it means I am going to be busy.

Elephants have right of way


Hadoop 2.1-beta: the elephants have come out to play

After many, many months, Hadoop 2.1 Beta is ready to play with


From Arun's announcement:  Users are encouraged to immediately move to hadoop-2.1.0-beta

Some other aspects of the announcement (with my comments in italic)
  • API & protocol stabilization for both HDFS & YARN:
    Protobuf-format payloads with an API designed to be forward compatible; SASL auth.
  • Binary Compatibility for MapReduce applications built on hadoop-1.x
    This is considered critical -test now and complain if there are problems.
  • Support for running Hadoop on Microsoft Windows
    Irrespective of your view of what server OS/cloud platform to run on, this means all Windows desktops can now talk HDFS & be used for Hadoop-dev.
  • HDFS Snapshots
    Lots of work by my colleagues here, especially Nicholas Tsz Wo: you can take a snapshot of directories and recover from your mistakes later. (JIRA, details).
  • Substantial amount of integration testing with rest of projects in the ecosystem
    That includes the trauma of switching from protobuf 2.4 to 2.5 at the behest of the HBase team, something that led to seven days of trauma earlier this month as we had to do a lock-step migration of the entire cross-project codebase. Credit to all here.
What big things of mine are in there?

Primarily  YARN-117: hardening the YARN service model for better resilience to failure and reliable subclassing. There's certainly more than a hint of my old HADOOP-3628 work in there, which is itself somewhat related to SmartFrog -but the YARN service model itself has some interesting aspects that I can't take credit for. I should write it all up. Now that we have a common service API, we could now take away the many service entry points and write a single entry point that takes the name of a class, walks it through its lifecycle and runs it. This is precisely what YARN-679 proposes. Which in turn exactly what the Hoya entry point is. I'm using it there in both entry point and testing, so that I can evolve it based on my experience of using it in real apps.

The HADOOP-8545 openstack module isn't in there -I'd have liked it but at the same time didn't want to cause extra trouble by getting it in. FWIW the patch applies as is, it works properly -anyone can add it to their own build.

Minor tweaks whose implications are profound but nobody has noticed yet
  • HADOOP-9432 Add support for markdown .md files in site documentation. This gives you an enhanced text format for docs that has good editor support, and renders directly in github.
  • HADOOP-9833 move slf4j to version 1.7.5. This is initially for downstream apps that share the classpath, but it adds an option to Hadoop itself: the ability of Hadoop modules to switch from the commons-logging API to the SLF4J one: varargs with level-specific execution of an efficient unformatted printf output. This makes debug statements that much cleaner -and with print statements throughout the codebase, helps it overall.
  • More FS tests and the patches to S3 and S3n to fix some aspects of their behaviour which could lead to loss of data if you got your rename destination wrong.
  • All those patches I've done to trunk since 0.21. Because this release incorporates them: it is the big ASF first release of all the stuff that hasn't been backported. I'd recommend upgrading for my network diagnostics alone.
Even so, these are noise compared to the big pieces of work, of which the key ones are HDFS enhancements and the YARN execution engine, YARN being the most profound.

I was one of the people who +1'd this release, here is what I did to validate the build. Notice that my process involved grabbing the binaries by way of the ASF M2 staging repo: I need to validate the downstream build more than just the tarball.

# symlink /usr/local/bin/protoc to the homebrew installed 2.5.0 version

# delete all 2.1.0-beta artifacts in the mvn repo:

  find ~/.m2 -name 2.1.0-beta -print | xargs rm -rf

# checkout hbase source: from Apache: branch-0.95 (commit # b58d596 )

# switch to ASF repo (arun's private repo is in the POM, with JARs with the same sha1 sum, I'm just being rigorous)
 <id>ASF Staging</id>

# clean build of hbase tar against the beta artifacts

mvn clean install assembly:single -DskipTests -Dmaven.javadoc.skip=true -Dhadoop.profile=2.0 -Dhadoop-two.version=2.1.0-beta

# Observe DL taking place

[INFO] --- maven-assembly-plugin:2.4:single (default-cli) @ hbase ---
[INFO] Assemblies have been skipped per configuration of the skipAssembly parameter.
[INFO] ------------------------------------------------------------------------
[INFO] Building HBase - Common 0.95.3-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: https://repository.apache.org/content/groups/staging/org/apache/hadoop/hadoop-annotations/2.1.0-beta/hadoop-annotations-2.1.0-beta.pom
Downloaded: https://repository.apache.org/content/groups/staging/org/apache/hadoop/hadoop-annotations/2.1.0-beta/hadoop-annotations-2.1.0-beta.pom (2 KB at 3.3 KB/sec)
Downloading: https://repository.apache.org/content/groups/staging/org/apache/hadoop/hadoop-project/2.1.0-beta/hadoop-project-2.1.0-beta.pom


# get md5 sum of hadoop-common-2.1.0-beta artifact in https://repository.apache.org/content/groups/staging/:
# verify version of artifact in local m2 repo
    $ sha1sum ~/.m2/repository/org/apache/hadoop/hadoop-common/2.1.0-beta/hadoop-common-2.1.0-beta.jar

# in hbase/hbase-assembly/target , gunzip then untar the hbase-0.95.3-SNAPSHOT-bin.tar file

# Patch the Hoya POM to use 2.1.0-beta instead of a local 2.1.1-SNAPSHOT

# run some of the hbase cluster deploy & flexing tests

mvn clean test  -Pstaging

 (all tests pass after 20 min)

Functional tests

# build and the Hoya JAR with classpath pulled in

mvn package -Pstaging

# D/L the binary .tar.gz file, and scp to an ubuntu VM with the hadoop conf properties for net-accessible HDFS & YARN services & no memory limits on containers


# stop the running hadoop-2.1.1-snapshot cluster

# start the new cluster services

hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager

# Verify in dfshealth.jsp that NN is up, version is built ( 2013-08-15T20:48Z by hortonmu from branch-2.1.0-beta ). (the NN came up in safe mode as a couple of in-flight blocks were missing; NN recovered from this happily after 20s)

# copy hbase-0.93 tarball to HDFS

hdfs dfs -copyFromLocal hbase-0.95.3-SNAPSHOT-bin.tar hdfs://ubuntu:9000/hbase.tar

# unfreeze an hbase cluster that was running on hbase-0.95.2 & Hadoop-2.1.1 (protobuf 2.4) versions & set up to use hdfs://ubuntu:9000/hbase.tar as the hbase image to install

java -jar target/hoya-0.3-SNAPSHOT.jar \
 org.apache.hadoop.hoya.Hoya thaw cl1  --manager ubuntu:8032 \
 --filesystem hdfs://ubuntu:9000

(notice how I have to spec the classname -this is the service launcher set up as the main class for the Hoya JAR; it can run any YARN service)

# verify cluster is running according to Hoya CLI and YARN web GUI

# point browser at HDFS master: http://ubuntu:8080/master-status , verify it is happy

# fault injection: ssh-in and kill -9 all HRegionServers

# verify that within 60s the # no. of region servers is as desired; YARN notified Hoya of container loss; replacement containers requested and region servers deployed.

# Freeze the cluster:
java -jar target/hoya-0.3-SNAPSHOT.jar org.apache.hadoop.hoya.Hoya\
 freeze cl1 \
 --manager ubuntu:8032 --filesystem hdfs://ubuntu:9000

# verify that hoya list command and YARN GUI show app as finished; HBase cluster is no longer present

This validation shows what YARN can bring to the table: you can not only run stuff in the same cluster, near the data, YARN works with the App Master to notify it of things failing, leaving it to the AM to handle it as it chooses. For Hoya, the action is: except when shutting down, ask for the number of containers needed to keep the cluster at its desired state. This is how we can run HBase clusters inside a YARN cluster. And, if you look at the source code, more is on the way...

Returning to Arun's annoucement: download and play with this, especially if you've been using Hadoop 2.0.5 or other pre-2.1 YARN platform. For anyone using 0.2.3.x in production, it's time to start regression testing during this beta phase, and coming up with an update strategy. For anyone on branch-1 based products, the upgrade is probably more significant -the HDFS improvements justify it irrespective of the features of YARN that matter most in large clusters and heterogeneous workloads. Again: download, start making sure that your code works in it -because these are the last few weeks to find critical bugs in that code before everything gets locked down until the successor release.

[photo: elephants in the Tarangire National Park, Tanzania. Probably the wildest night camping of the trip. Children can't run in the campsite in case they get mistaken for prey, and the driver couldn't get at the truck all night due to the 6 lions sleeping by it].


Publishers: I'm not going to write or review a book on Hadoop for you

This is another of my stock responses, to go with those on LI connection requests and recruiters. This one is for publishers, with a special callout to Packt Publishing:
  1. I agree, I have the knowledge and skills to write an excellent book on Hadoop.
  2. I have no motivation to do so.
  3. I wouldn't publish it through your tier-2 publishing house anyway.
  4. I will not review any of your books as it is not worth my time
Therapy By Inkie

Amazon's Print on Demand (HP WebPress!) services appear to have brought some new players to the publishing market, ones with no upfront investment in paying for an initial print run. This lets them get authors to write books and, after a bit of reviewing, stick it up for sale -with no risk.  As a result, the have clearly commission-based "author executives" who waste their life scanning LI for people with Hadoop in their resume then effectively spamming them.

Every so often , an email hits one of my inboxes noting that I work on Hadoop and wondering if I am interested in writing a book for their -invariably unknown- publishing house.

From: Parita Khedekar <paritak@packtpub.com>
Date: 22 January 2013 12:02
Subject: [Steve], Author a "Big Data Analytics with R and Hadoop" book for Packt.
To: stevel
Hi Steve,

My name is Parita Khedekar and I am an Author Relationship Executive for Packt Publishing. We specialize in publishing IT related books, e-books, and articles that have been written by experts in the field.

We are currently looking out for prospective authors to write our book related to Big Data:

Big Data Analytics with R and Hadoop  aimed at Data Analysts and Scientists using Hadoop who need to take advantage of the R integration in their projects.
Scaling Big Data with Hadoop and Solr aimed at Solr developers who want to know how to leverage the flexible search functionality of Apache Solr and the Big Data processing of Apache Hadoop, to create the indexes for both general search and augmented data analytics.

Given your experience with this technology, I was wondering if you would be interested in authoring either of this titles.

Looking forward to hear from you and do let me know if you have any queries or doubts.


Parita Khedekar
Author Relationship Executive
PACKT Publishing

The answer to this is, "no", for the following reason: I have already written a successful book on an open source project.

I know precisely how much effort it takes to update a good book: a lot. A book on software is effectively a software artefact, code that is required to work, along with consistent documentation. The more ambitious the first edition, the higher the maintenance costs- unless it has somehow been designed for maintenance up front.

Writing a paper of 10 pages can be done in under a week, revisions included. A chapter of 20 pages doesn't take two weeks, it takes 3-4. A book of 10 chapters doesn't take 10 * 3 weeks, it takes months more. Books, like software, are not O(n) products.

I know precisely how rapidly a book on a poplar open source project goes out of date: at the release rate of the software. Whereas a book on a closed source product, say MS Office 2010, is entirely in sync with the product for its entire lifespan, a book on an OSS app ages visibly With every release. This increases pressure for timely updates.

I know the ROI issues with writing a book, namely how many hours it takes, and how much you get in return: no much per hour of work. I also know that publishers don't have to care about how many hours the authors put in, as long as they meet their deadlines. Their ROI equations are based in cost of post-authoring actions: review management, typesetting and printing.  With print on demand, printing risk is eliminated, leaving only typesetting and reviewing. Skip quality typesetting and get reviewers to work  for free and you have no costs other than some minion to track the authoring process.

I also know -and this is relevant to anyone going "hey, how would you like to write a book"- multiple publishers. As well as Manning Press, people in O'Reilly. Both are brands with reputations for quality books.

So, for anyone inviting me to take up the opportunity to write a book on Hadoop for them:
 go away.

Regarding reviewing, it adds more work to my life for no benefit. Usually the reward is a PDF or hardcopy of the book. But consider this, a PDF costs the publishers $0, so you are being paid $0/hour. Even if you get the hardcopy, a $30 book would cost $15 to print -for a 2h review you are being paid $7.50. I consider my time more valuable than this.

There is one exception, If the book is by someone I know or work with, I may be able to put aside the time to do the reviewing -such people can contact me direct

Therapy By Inkie

Finally, the author talent acquisitions team should take the same advice as recruiters: do your research.

Take this approach

From: Anish Sukumaran <anishs@packtpub.com>
Date: 15 July 2013 12:32
Subject: [Steve], Author a 110 page book 'Apache ZooKeeper Administrator's Guide ' for Packt Publishing
To: stevel@hortonworks.com

Hello Steve,

My name is Anish and I am an Author Acquisition Executive at Packt Publishing. Packt is a rapidly growing, dedicated IT book Publishing firm and has rolled out more than thousand books on various titles till date.

Packt is now planning to publish a book titled as 'Apache ZooKeeper Administrator's Guide ' which would be a 110 page micro book and in the process of seeking potential authors to work on it , I also came across your Presentations on Slideshare. it is evident that you have a commendable experience and knowledge in this area.

It would be my pleasure to invite you to write this book for us.

Do let me know your decision and also if you have any queries, I will be happy to answer them.

Looking forward to hear from you.

Kind regards,
Anish Sukumaran
Author Acquisition Executive
PACKT Publishing
MSN: anishs@packtpub.com

This "Author Acquisition Executive"  has been delving though slide share, looking at the presentations to see if they could identify possible authors, before approaching them.

Yet clearly, they have failed to do two things

1. Look up some internal spreadsheet of people who have already turned down a publishing opportunity with the publisher.

2. Go to Amazon and enter my name -because if they had done so they'd know I was already an author, perfectly capable of getting a book published by a quality publisher if I so chose to sit down and write one. Therefore I was already doing it for some other publishing house, or I was not in the mood to destroy all my free time for the next 12 months to write one.

Henceforth anyone spamming me with an publishing opportunity from a near-unknown Print on Demand shop will not only get sent this link, their approach will be added and ridiculed.

(photos: Therapy  by Inkie, commissioned piece for the hairdressers by the Highbury Vaults, captured at sunset)


Hoya: HBase on YARN

I didn't go to the Hadoop Summit, though it sounds really fun. I am having lots of of fun at a Big Data in Science workshop at Imperial college instead, where problems like "will the code to process my data still work in 50y", as well as the problems that the Square Kilometre Array will have (10x the physics dataset, sources across a desert, datacentre in the desert too)

What did make it over to the Summit is some of my code, the latest of which is Hoya, HBase on YARN. I have been busy coding this for the last four weeks:
Outside office

Having the weather nice enough to work outside is lovely. Sadly, the wifi signal there is awful, which doesn't matter until I need to do maven things, where I have to run inside and hold the laptop vertically beneath the base station two floors above.

Coding Hoya at the office

It's not that readable, but up on my display is the flexing code in Hoya: the bit in the AM that handles a request from the client to add or remove nodes. It's wonderfully minimal code, all it does is compare the (possibly changed) value of worker nodes wanted with the current value, and decides whether to ask the RM for some more nodes (using the predefined memory requirements of a Region Server), or to release nodes -in which case the RM will kill the RS, leaving the HBase master to notice this and handle the lost.

Asking for more nodes leaves the YARN RM to satisfy it, when then calls back to the RM saying "here they are". At which point Hoya sets up a launch request containing references to all the config files and binaries that need to go to the target machine, and a command line that is the hbase command line. There is no need for a Hoya-specific piece of code running on every worker node; YARN does all the work there.

Some other aspects of Hoya for the curious
  • Hoya can take a reference to a pre-installed HBase instance, one installed by management tools such as Ambari, or kickstart installed into all the hosts. Hoya will ignore any template configuration file there, pushing out its own conf/ dir under the transient YARN-managed directories, pointing HBase at it.
  • Although hbase supports multiple masters, Hoya just creates a single master exec'd off the Hoya AM. All but the live HBase master are simply waiting for ZK to give them a chance to go live -they're there for failure recovery. It's not clear we need that, not if YARN restarts the AM for us.
  • Hoya remembers its cluster details in a ~/.hoya/clusters/${clustername} directory, including the HBase data, a snapshot of the configuration, and the JSON file used to specify the cluster.  You can machine-generate the cluster spec if you want.
  • The getClusterStatus() AM API call returns a JSON description of the live cluster, in the same JSON format. It just adds details about every live node in the cluster. It turns out that classic Hadoop RPC has a max string size of <32K, so I'll need to rework that for larger clusters, or switch to protobuf, but the idea is simple: the same JSON structure is used for both the abstract specification of the cluster, and the description of the instantiated cluster. Some former colleagues will be noting that's been done before, to which the answer is "yes, but this is simpler and with a more structured format, as well as no cross-references".
  • I've been evolving the YARN-679 "generic service entry point" for starting both the client and services. This instantiates the service named on the command line, hooking up signal handling to stop it. It then invokes -if present- an interface method, int runService() to run the service, exiting with the given error code. Oh, and it passes down the command line args (after extracting and applying conf file references and in-line definitions from it), before Service.init(Config) is called. This entry point is designed to eliminate all the service-specific entry points, but also provide some in-code access points to -letting you use it to create and run a service from your own code, passing in the command line args as a list/varags. I used that a lot in my tests, but I'm not yet sure the design is right. Evolution and peer-review will fix that.

Developing against YARN in its last few weeks of pre-beta stabilisation was entertaining -there was a lot of change. A big piece of it -YARN-117, was my work; getting it in meant that I could switch from the fork I was using to that branch, after which I was updating hadoop/branch-2; patching my code to fix any compile issues, retesting every morning. Usually: seamless; one day it took me until mid-afternoon for all to work, an auth-related patch on Saturday stopped test clusters working until Monday. Vinod was wonderfully help here, as was Devaraj with testing 50+ node clusters. Finally, on the Tuesday the groovyc support in Maven stopped working for all of us in the EU who caught an incompatible dependency upgrade first. To their credit the groovy dev team responded fast there, not only with a full fix out by the end of the day, but with some rapid suggestions on how to get back to a working build. It's just as you are trying to get something out for a public event, these things always hit your schedule: plan for them.

Also: Hoya is written in a mix of Java (some of the foundational stuff), and Groovy -all tests and the AM & client themselves. This was my second attempt at a Groovy YARN app, "Grumpy" being my first pass back in Spring 2012, during my break between HPLabs and Hortonworks. That code was never finished and too out of date to bother with; I started with the current DistributedShell example and used that -while tracking changes made to there during the pre-beta phase and pulling it over. The good news: a big goal of Hadoop 2.1 is stable protobuf-based YARN protocols, stable classes to help.

Anyway, Hoya works as a PoC, we should be letting out for people to play with soon. As Devaraj has noted: we aren't committed to sticking with Groovy. While some features were useful:  lists, maps, closures, and @CompileStatic finds problems fast as well as speeding up code, it was a bit quirky and I'm not sure it was worth the hassle. For other people about to YARN apps, have a look at Continuuity Weave and see if that simplifies things .

P.S: we are hiring.


Tilehurst? Where is Tilehurst and why does google maps care about it?

Google are being asked hard questions in Parliament about their UK tax setup.

I think the politicians are missing an opportunity to ask them the question that I'm always wondering: where is Tilehurst and why does google maps think it is so special.

Here is a google maps view of the UK

Google mapview UK
It has Bristol on it, but not Portsmouth or Cardiff. Its a always a mystery in Bristol while Pompey gets a dot on the BBC weather map, as does BRS's nearby rival, Cardiff. In the google map, Edinburgh and Manchester are the ones being left out.

But that is nothing compared to the Tilehurst question. Specifically : why?

Look what happens when you click to zoom in one notch.
Tilehurst? Where is tilehurst?
Edinburgh exists, along with pretty much everything north of their excluding Mallaig, which is something all visitors to Scotland should do when laying out an itinerary.

And what is there between Bristol and London. One town merits a mention. Tilehurst.

Apart from this mention of Tilehurst, I have no data on whether or not this town actually exists. It's not on any motorway exits on the M4, no train stations, no buses from Bristol. I have never heard it mentioned in any conversation whatsoever.

Why then does Google Maps think that it is more important than, say, Reading, which meets all of the above criteria (admittedly, never in conversations that speak positively of it), Oxford, which people outside the UK have heard of.

No, Tilehurst it is.

It could be some bizarre quirk of the layout algorithm that picks a random place ignoring things like nearby population numbers or using M-way exit signs, mentions in pagerank or knowledge of public transport.

I think it could just be some spoof town made up to catch out people who have been copying map data from google maps without accreditation. If some map or tourist guide mentions Tilehurst, the google maps team will know that they are using Google map data and immediately demand some financial recompense, routed through the Ireland subsidiary.

There's only one way to be sure: using this resolution map as the cue, drive there and see what it is.


Strava bringeth bad news

I'm in the bay area right now, and the new owner of a Google Nexus phone, which is very good at integrating with Google apps, including calendar, contacts and mail. It also runs Strava, the defacto standard app for logging your cycling, then uploading the results to compare with others. I'm assuming its running Hadoop at the back end, given their Platform Product Manager is one of the ex-Yahoo! Hadoop team.

If this is the case, Hadoop is indirectly bringing me bad news.

Yesterday I went out on the folding bike and climbed the Santa Cruz mountains, west of the Bay Area flatlands.
Old la honda map segment
It's a great steep climb from behind Palo Alto up to the skyline ridge, narrow and free from almost all traffic bar the many other locals who felt that Saturday morning was to nice to waste. For me: 26 minutes climbing, 40s of rest -fast enough to come in the top 50% of the day of everyone else running the app, and hence 2233 of 5810 of everyone who has ever done it. Not bad work.

Good news from strava: 62/145 on Old La Honda
If there's a warning sign, it is that people faster than me have a quoted average power output less then my 257W -and as they all took less time, that means that their total exertion is less than my 412kJ. Why do those ahead of me come in lower? It means they are carrying less excess weight.

If I'd stopped there and descended (carefully, that bikes 20" rims overheat), I could get back and feel relatively smug.

Only this time I descended the far side and climbed back up "Alpine West" -first time ever. And it destroyed me
Bad news from strava: 28/29 on west alpine
It was long. I stopped for lunch partway up, but needed that rest, and continued, expecting the overall climb to be on a par with the earlier one. It wasn't. In fact the total climb was double, 600m. Which I was not ready for. Unlike the morning, where I'd got to pass lots of people, there the road was near empty -and those people I did meet were going past me. The rate of ascent, 562m, is less than the 600m/hour rate we used to plan for when crossing the alps -a rate sustained over a week or more, carrying panniers. Not today.

The message from Strava then, which means something Hadoop worked out, is that I am overweight and completely lacking in endurance. It' doesn't quite spell it out, but the graphs bring the message

This is datamining at work.


Software updates are the bane of VMs -and Flash is its prophet

It's the first tuesday of the month, so it's Flash update time. Three critical patches, where "critical" means "if you don't update it your computer will belong to someone else"

Adobe Flash Install Screen

These flash updates are the bane of my life. I have to update the three physical machinesin the house, and with two of them used by family members, I can't ignore updating any of them.

The workflow for flash updates is
  1. Open Settings manager, find the flash panel, start that, get it to check for an update.
  2. If there is one, it brings up the "a new update is available, would you like to install it"? dialog.
  3. Flash Control Panel opens up firefox with a download page: start that download.
  4. Close down Firefox
  5. Close down Chrome
  6. Close down the Flash Control Panel (if still present)
  7. Close down the settings manager
  8. Find the flash .dmg file in ~/Downloads
  9. open it
  10. click on the installer
  11. follow its dialog
  12. eject the mounted .dmg image
  13. restart your browsers. This is always a good time to look for Firefox updates too, then check if it recommends any other browser updates.
  14. For all gmail logins, the two-level auth.
That's a repeat 3x operation, with the extra homework that on a multi-login machine, I have to "sudo killall firefox && sudo killall chrome" the other user's browser instances to make sure that the update has propagated (the installer doesn't block if these are running, as it doesn't look for them).

Then comes the VMs. Two windows boxes stripped down to the minimum: no flash, no MSOffice, or Firefox, but Chrome and IE.  IE setup to only trust adobe.com, microsoft.com and the windows update, where trust is "allow installed AX controls".

Manual updates there too, with the MS patch also potentially forcing restarts.

This show the price of VMs: every VM needs to be kept up to date. The no. of VMs I have to update is not O(PCs), it is O(PCs)*(1+O(VMs/PC)

Most of the VMs are on my machine, one Linux VM for native code builds, other VMs for openstack, more for a local LinuxHA cluster

There it is simpler, "yum -y update && shutdown -h0" or the same for "apt-get -update".

Which shows why Linux makes the best OS for my VMs. It's not so much the cost, or the experience, but the near-zero-effort update-everything operation. 

It also show's where Apples "app-store" mind view is limited. Because new App-store apps must be sandboxed and save all state to their (mediocre) Cloud, there's no way for the Appstore to update browsers or the plugins integrated with them. Which leaves two outcomes
  1. Someone needs to go to each mac and go through steps 1-14 above.
  2. They don't get updated, and end up being 0wned.
It's easy to fault Apple here, but it really reflects a world view that we have for software in general, "out of band security updates are so unlikely we don't need to make it easy". Once we switch to assuming that there may be an emergency patch any day of the week, we start thinking "how would I do this as a background task" -which is something all of us need to consider.