2018-10-05

Java's use of Checked Exceptions cripples lambda-expressions



I like lambda-expressions. They have an elegance to them which, when I put into my code along with comments using the term "iff", probably marks me out as a Computer Scientist; the way people who studied Latin drop random phrases into sentences to communicate more precisely with others who did the same. Here, rather than use phases like "sue generis", I can drop in obscure references to Church's work, allude to "The Halting Problem" and say "tuple" whenever "pair" wouldn't be elitist enough.

Jamaica street, September 2018
Really though, lambda-expressions are nice because they are a way to pass around snippets of code to use elsewhere

I've mostly used this in tests, with LambaTestUtils.intercept() being the code we've built up to use them, something clearly based on ScalaTest's work of the same name.

protected void verifyRestrictedPermissions(final S3AFileSystem delegatedFS)
    throws Exception {
  intercept(AccessDeniedException.class, 
      () -> readLandsat(delegatedFS));
}

I'm also working on wiring up the UserGroupInformation.doAs() call to l-expressions, so we don't have to faff around creating over-complex PrivilegedAction subclasses, instead go bobUser.do(() -> fs.getUsername()). I've not done that yet, but have the stuff in my tests to explore it: doAs(bobUser, () -> fs.getUsername()).

Java-8 has embraced this, with its streams API, Optional class, etc. I should be able to do the same elegant code in Java 8 that you can do in Scala, such as on an Optional&ltUserGroupInformation%gt; instance —no more need to worry about null pointers!

Optional&ltCredentials&gt maybeCreds = maybeBobUser.map.doAs( (b) -> b.getCredentials())

And I can the same on those credentals

List<TokenIdentifier> ids = maybeCreds.map(::getAllTokens).stream()
    .map(::decodeTokenIdentifier)
    .getOrElse(new LinkedList&lt>()).stream()

Except, well, I can't. Because of checked exceptions. That, Token::decodeTokenIdentifier method can raise IOException instances whenever there's a problem decoding the byte array which contains the token identifier (it can also return null for other issues; see HADOOP-15808).

All Hadoop API calls which do some kind of network or IO operation declare they throw an IOException when things fail. It's consistent, it works fairly well. Sometimes interactions with underlying libraries (AWS SDK, Azure SDK) we catch & map, but we also do other error translation there too, then feed that into retry logic and things even out. When you call getFileStatus() against s3a: or abfs:// you can be confident that if its not there you'll get a FileNotFoundException; if there was some connectivity issue our code will have retried, provided it wasn't something unrecoverable like a DNS/Routing problem, where you'll get a NoRouteToHostExcepotion in your stack traces.

Checked exceptions are everywhere in the Hadoop code.

And the Java Streams API can't work with that. All the operations on a stream don't declare that they raise exceptions, so none of the lambda-expressions you can call on them may either. I could jump through hoops and catch & convert them into some RuntimeException —but then what? All the code which is calling mine expects failures to come as IOExceptions, expect those FileNotFoundExceptions, etc. We cannot make serious use of the new APIs in our codebase.

Now, the Oracle team could just have declared that the new map() method raised Exception or similar, but then it'd have been unusable in those methods which don't declare that they throw exceptions, or those which say, throw IOExceptions.

There's no obvious solution to this with those standard Java classes, leaving me the options of (a) not using them or (b) writing my own -which something I've been doing in places. I shouldn't have to do that, all it does is create maintenance pain and doesn't glue together with those standard libraries.

I don't have a choice. And neither does anyone else using Java. Scala doesn't have this problem as exceptions aren't checked. Groovy doesn't have this problem as exceptions aren't checked. C# doesn't have this problem as exceptions aren't checked. Java, however, is now trapped by some design decisions made twenty+ years ago which seemed a good idea at the time.

Is there anything Oracle can do now? I don't know. You could change the compiler to say "all exceptions are unchecked" and see what happens. I suspect a lot of code will break. And because it'll be on the failure paths where problems surface, it'd be hard to get that test coverage to be sure that failures are handled properly. Even so, I can imagine that happening, otherwise, even as the language tries to "stay modern", it's crippled.

2018-04-02

Computer Architecture and Software Security


Gobi's End
There's a new paper covering another speculative excuation-based attack on system secrets, BranchScope.

This one relies on the fact that for branch prediction to be effective, two bits are generally allocated to it, strongly & weakly taken and strongly & weakly not taken. The prediction state of a branch is based on the value in BranchHistoryTable[hash(address)]) and used to choose the speculation; if it was wrong it is moved from strongly -> weakly, and from weakly to opposite. Similarly, in weakly taken/non taken, if the prediction was taken, then its moves to strong.

Why so complex? Because we loop all the time
for (int i = 0; i < 1000) {
  doSomething(i);
}

Which probably gets translated into some assembly code (random CPU language I just made up)

    MOV  r1, 0
L1: CMP r1, 999
    JGT end
    JSR DoSomething
    ADD r1, 1
    JMP  L1
    ... continue

For 1000 times in that loop. the branch is taken, then once, at the end of the loop, it's not taken. The first time it's encountered, the CPU won't know what to do, it will just guess one of them and have a 50% chance of being wrong (see below). After that first iteration though it'll guess right, until the final test fails and the loop is exited. If that loop is itself called repeatedly, the fact that final iteration was mispredicted shouldn't lose the fact that the rest of the loop was predicted repeatedly. Hence, two bits.

As Hennessey and Patterson write in Computer Architecture, a quantitive approach (v4, p89), "the importance of branch prediction has increased". With deeper pipelines and the mismatch of CPU speed and memory, guessing right matters.

There isn't enough space in the Branch History Table to store 2 bits of history for every single branch in a system, so instead there'll be some smaller table and some function to take the full address and map it to an offset in that table. According to [Pan92], 4096 to 8192 entries is not that far off "an infinite" table. All that's left is the transform from program counter to BHT entry, which for 32 bit aligned opcodes something as simple as (PC >> 4) & 8191.

But the table is not infinite, there will be clashes: if something else is using the same entry in the BHT, then your branch may be predicted according to its history.

The new attack then simply works out the taken/not taken state of the target branch by seeing how your own code, whose addresses are designed to conflict, is predicted. That's all. And given that ability to predict branch direction, using it to reach conclusions about the state of the system.

Along with caching, branch prediction is the key way in which modern CPUs speed things up. And it does. But it's the clash between your entries in the cache and BHT and that of the target routine which is leaking information: how long it takes to read things, whether a branch is predicted or not. The very act of speeding up code is what leaks secrets.

"Modern" CPU Microarchitecture is in trouble here. We've put decades of work into caching, speculation, branch prediction, and now they all turn out to expose information. We built for speed, at what turns out to be the cost of secrecy. And in cloud environments where you cannot stop malicious code running on the same CPU, that means your secrets are not safe.

What can we do?

Maybe another microcode patch is possible: when switching from usermode to OS mode then the BHT is flushed. But that will cripple performancve in any loop which invokes system code in it. Or you somehow isolate BHT entries for different virtual memory spaces. Probably the best long term, but I'll leave it to others to work out how to implement.

What's worrying is the fact that new exploits are appearing so soon after Meltdown and Spectre. Security experts are now looking at all of the speculative execution bits of modern CPUs and thinking "that's interesting..."; more exploits are inevitable. And again, systems, especially cloud infrastructures, will be left struggling to catch up.

Cloud infrastructures are probably going to have to pin every VM to a dedicated CPU, with the hypervisor on its own part. That will limit secret exfiltration to the VM OS and anything else running on the core (the paper looks at the intel SGX "secure" zone and showed how it can be targeted). It'll be the smaller VMs at risk here, and potentially containerized stuff: you'd want all containers on a single core to be "yours".

What about single-core systems running a mix of trusted and trusted code (your phone, your web browser)? That's going to be hard. You can't dedicate one x86 core per browser tab.

Longer term: we're going to have to go through every bit of modern CPU architecture from a security perspective and say "is this safe?" And no doubt conclude, any speedup mechanism which relies on the history of previous work is insecure, if that history includes the actions taken (or speculatively taken) by sensitive applications.

Which is bad news for the majority of today's high end CPUs, especially those ones trying to keep the x86 instruction set alive. Those are the parts which have had so much effort invested into getting fractional improvements in caching, branch prediction, speculation and pipeline efficiency, and so have gotten incredibly complex. That's where the big vulnerabilities live.

This may push us back towards "underperformant but highly parallel" massivley multicore systems. Little/no speculation, isolating user space code into their own processes.

The most recent example of this is/was the Sun Niagara CPU line, which started off with a pool of early-90s era SPARC CPUs without fancy branch prediction...intead they had 4 set of state to cover the entire execution state of four different threads, scheduling work between them. Memory access? Stall that thread, schedule another. Branch? Don't predict, just wait and see, and add other thread opcodes to the pipeline.

There's still going to be security issues there (cache shared across the many cores, the actions of one thread can be implicitly observed by others in their execution times). And it seemly does speculate memory loads if there was no other work to schedule.

What's equally interesting is that the system is so power efficient. Speculative execution and branch prediction (a) requires lots of gates, renamed registeres, branch history tables and the like —every missed prediction or branch is energy wasted. Compare that to an Itanium part, where you almost need to phone up your electricity supplier for permission to power one up.

The Niagara 2 part pushed it ahead further to a level that is impressive to read. At the same time, you can see a great engineering team struggling with a fab process behind what Intel could do, Sun trying to fight the x86 line, and, well, losing.

Where are the parts now? Oracle's M8 CPU PDF touts its Out Of Order execution —speculative execution—, and data/instruction prefetch. I fear it's now got the same weaknesses of everything else. Apparently the java 8 streams API gets bonus speedup, which reminds me to post something criticising Java checked execution for making that API unusable for the throws IOException Hadoop codebase. As for the virtualization support, again, you'd need to think about pinning to a CPU. There's also that $L1-$L3 cache hit/miss problem: something speculating in one CPU could evict cached data observable to others, unless speculative memory fetches weren't a feature of the part.

They look nice-but-pricey servers; if you are paying the Oracle RDBMs tax the all-in-one price might mitigate that. Overall though, with a focus on many fast-but-dim parts over a smaller number of "throw Si at maximum single thread" architecture of recent x86 designs may provide opportunities for future designs to be more resistant to attacks related to speculative execution. I also think I'd like to see their performance numbers running Apache Spark 2.3 with one executor per thread and lots of RAM.

Update April 3 2018: I see within hours of this posting rumour start that Apple is looking at ARM parts for macbooks in 2020+. Not a coincidence! Actually it is, but because the ARM parts are simpler, they may be less exposed to specex-based attacks, even though Meltdown did affect those implementations which did speculative memory fetches. I think the Niagara architecture has more potential, but it probably works best in massively-multithreaded server side systems, not laptops where latency is the performance metric, not throughput.

[Photo: my 2008 Fizik Gobi saddle snapped one of its Titanium rails last week. Made it home in the rain, but a sign that after a decade, parts just wear out.]

2018-01-29

Advanced Deanonymization through Strava

Slow Except for Strava

Strava is getting some bad press, because its heatmap can be used to infer the existence and floorplan of various US/UK military and govt sites.

I covered this briefly in my Berlin Buzzwords 2016 Household INFOSEC talk , though not into that much detail about what's leaked, what a Garmin GPS tracker is vulnerable to (Not: classic XML entity/XInclude attacks, but a malicious site could serve up a subverted GPS map that told me the local motorway was safe to cycle on).
Untitled

Here are some things Strava may reveal

  • Whether you run, swim, ski or cycle.
  • If you tell it, what bicycles you have.
  • Who you go out on a run or ride with
  • When you are away from your house
  • Where you commute to, and when
  • Your fitness, and whether it is getting better or worse.
  • When you travel, what TZ, etc.

How to lock down your account?

I only try to defend against drive-by attacks, not nation states or indeed, anyone competent who knows who I am. For Strava then, my goal is: do not share information about where my bicycles are kept, nor those of my friends. I also like to not share too much about the bikes themselves. This all comes secondary to making sure that nobody follows me back from a ride over the Clifton Suspension Bridge (standard attack: wait at the suspension bridge, cycle after them. Standard defence: go through the clifton backroads, see if you are being followed). And I try to make sure all our bikes are locked up, doors locked etc. The last time one got stolen was due to a process failure there (unlocked door) and so the defences fell to some random drug addict rather than anyone using Strava. There's a moral there, but it's still good to lock down your data against tomorrow's Strava attacks, not just today's. My goal: keep my data secure enough to be safe from myself.
  1. I don't use my real name. You can use a single letter as a surname, an "!", or an emoji.
  2. And I've made sure that none of the people I ride regularly do so either
  3. I have a private area around my house, and those of friends.
  4. All my bikes have boring names "The Commuter", not something declaring value.
  5. I have managed fairly successfully to stay of the KoM charts, apart from this climb which I regret doing on so many levels.
For a long time I didn't actually turn the bike computer on until I'd got to the end of the road. I've got complacent there. Even though Strava strips the traces from the private zone when publishing, it does appear to declare the ride distance as the full distance. Given enough rides of mine, you can infer the radius of that privacy zone (fix? Have two overlapping circles?), and the distance on road from the cutoff points to where my house is (overlapping circles won't fix that). You'd need to strip out the start/stop points before uploading to strava (hard) or go back to only switching on recording once you were a randomish distance from your setoff point.

I haven't opted out of the Strava Heatmap, as I don't go anywhere that people care about. That said, there's always concerns in our local MTB groups that Strava leaks the non-official trails to those people who view stopping MTB riders as their duty. A source of controversy.

Now, how would I go for someone else's strava secrets?

You can assume that anyone who scores high in a mountain bike trail is going to have an MTB worth stealing, same for long road climbs.
  1. Ride IDs appear sequential, so you could harvest a days' worth and search.
  2. Join the same cycling club as my targets, see if they publish rides. Fixes: don't join open clubs, ever, and verify members of closed clubs.
  3. Strava KoM chart leakage. Even if you make your rides private, if you get on top 10 riders for that day or whatever, you become visible.
The fact that you can infer nation-state secrets is an interesting escalation. Currently it's the heatmap which is getting the bad press, which is part of the dataset that Strava offer commercially to councils. FWIW, the selection bias on Strava data (male roadies or mountain bikers) means that its not that good. If someone bought our local data, they'd infer that muddy wood trails with trees and rocks are what the city needs. Which is true, but it doesn't address the lack of any safe way to cross the city.

What is interesting about the heat map, and not picked up on yet, is that you can potentially deanonymize people from it.

First, find somewhere sensitive, like say, The UK Faslane Nuclear Submarine Base. Observe the hot areas, like how people run in rectangles in the middle.
Faslane Heat Map
Now, go to MapMyRide and log in. Then head over to create a new route using the satellite data
Created Route from the hotspot map

Download the GPX file. This contains the Lat/Long values of the route

If you try to upload it to strava, it'll reject it as there's no timing data. So add it, using some from any real GPX trace as a reference point. Doesn't have to be valid time, just make it slow enough that Strava doesn't think you are driving and tell you off for cheating.

Upload the file as a run, creating a new activity
Faked run uploaded

The next step is the devious one. "Create a segment", and turn part/all of the route into a Strava segment.
Creating a segment from the trace


Once strava has gone through its records, you'll be able to see the overall top 10  runners per gender/age group, when they ran, it who they ran with. And, if their profile isn't locked down enough: which other military bases they've been for runs on.

And now we wait to see who else did it


I have no idea who has done this run; whether there'll be any segment matches at all. If not, maybe the trace isn't close enough to the real world traces, everyone runs round clockwise, or, hopefully, people are smart enough to mark the area as a private. I'll leave strava up overnight to see what it shows, then delete the segment and run.

Finally, Berlin Buzzwords CFP is open, still offering to help with draft proposals. We can now say it's the place where Strava-based infosec issues were covered 2 years before it was more widely known.

Update 2018-01-29T21:13. I've removed the segment.

Removing Segment
Some people's names were appearing there, showing that, yes, you can bootstrap from a heatmap to identification of individual people who have run the same route.
Segment top 17, as discovered
There's no need to blame the people here, so I've pulled the segment to preserve their anonymity. But as anyone else can do it, they should still mark all govt. locations where they train as private areas, so getting included from the heatmap and strava segments.

I don't know what Strava will do long term, but to stop it reoccurring, they'll need to have a way to mark an area as "private area for all users". Doable. Then go to various governments and say "Give us a list of secret sites you don't want us to cover". Which, unless the governments include random areas like mountain ranges in mid wales, is an interesting list of its own.

Update 2018-01-30T16:01 to clarify on marking routes private

  1. All ride/runs marked as "private" don't appear in the leader boards
  2. All ride/runs marked as "don't show in leader boards" don't appear
  3. Nor do any activities in a privacy zone make it onto a segment which starts/ends there
  4. But: "enhanced privacy mode" activities do. That is: even you can't see an individuals's activities off their profile, you can see the rides off the leaderboard.
Update 2018-01-31T00:30 Hacker News coverage

I have made Hacker news. Achievement Unlocked!

Apparently
This is neither advanced nor denanonymization (sic).
They basically pluck an interesting route from the hotmap (as per other people's recent discovery), pretend that they have also run/biked this route and Strava will show them names of others who run/biked the same way. That's clever, but that's not "advanced" by any means.
It's also not a deanonymization as there's really no option in Strava for public _anonymous_ sharing to begin with.

1. Thanks for pointing out the typo. Fixed.

2. It's bootstrapping from nominally anon heatmap data to identifying the participants of the route. And unless people use nicknames (only 2 of the 16 in the segment above) did, then you reveal your real name. And as it shows the entire route when you click through the timestamp, you get to see where they started/finished, who if anyone they went with, etc, etc. You may not their real name, but you know a lot about them.

3. "It''s not advanced". Actually, what Strava do behind the scenes is pretty advanced :). They determine which of all recorded routes they match that segment, within 24h. Presumably they have a record of the bounds of every ride, so first select all rides whose bounds completely enclose the segment. Then they have to go through all of them to see if there is a fraction of their trail which matches.. I presume you'd go with the start co-ord and scan the trace to see if any of the waypoints *or inferred bits of the line between two recorded waypoints* is in the range of that start marker. If so, carry on along the trace looking for the next waypoint of the segment; giving up if the distance travelled is >> greater than the expected distance. And they do that for all recorded events in past history. 

All I did was play around with a web UI showing photographs from orbiting cameras, adjacent to a map of the world with humanities' activities inferred by published traces of how phones, watches and bike computers calculated their positions from a set of atomic clocks, uploaded over the internet to a queue in Apache Kafka, processed for storage in AWS S3, whose consistency and throttling is the bane of my life and rendered via Apache Kafka, as covered in Strava Engineering. That is impressive work. Some of their analysis code is probably running through lines of code which I authored, and I'm glad to have contributed to something which is so useful, and, for the heatmap, beautiful to look at. 

So no, I wasn't the one doing the advanced engineering —but I respect those who did, and pleased to see the work of people I know being used in the app.

2018-01-10

Berlin Buzzwords: CFP with an offer of abstract review

Berlin Buzzwords CFP is open, which, along with Dataworks Summit in April, is going to make Berlin the place for technical conferences in 2018.
Berlin
As with last year, I'm offering to review people's abstracts before they're submitted; help edit them to get the text to be more in the style that reviewers to tend to go for.

When we review the talks, we look for interesting things in the themes of the conference, try and balance topics, pick the fun stuff. And we measure that (interesting, fun) on the prose of the submissions, knowing that they get turned into the program for the attendees: we want the text to be compelling for the audience.

The target audiences for submissions then are twofold. The ultimate audience is the attendees. The reviewers? We're the filter in the way.

But alongside that content, we want a diverse set of speakers, including people who have never spoken before. Otherwise it gets a bit repetitive (oh, no, stevel will talk on something random, again), and that's no good for the audience. But how do we regulars get in, given that the submission process is anonymous?

We do it by writing abstracts which we know the reviewers are looking for.

The review process, then, is a barrier to getting new speakers into the talk, which is dangerous: we all miss out on the insights from other people. And for the possible speakers, they miss out on the fun you have being a speaker at a conf, trying to get your slides together, discovering an hour in advance that you only have 20 min and not 30 for your talk and picking 1/3 of the slides to hide. Or on a trip to say, Boston, having your laptop have a hardware fault and you being grateful you snapshotted it onto a USB stick before you set off. Those are the bad points. The good bits? People coming up to you afterwards and getting into discussion about how they worked on similar stuff but came up with a better answer, how you learn back from the audience about related issues, how you can spend time in Berlin in cafes and wandering round, enjoying the city in early summer, sitting outside at restaurants with other developers from around Europe and the rest of the world, sharing pizza and beer in the the evening. Berlin is a fun place for conferences.

Which is why people should submit a talk, even if they've never presented before. And to help them, feel free to stick a draft up on google docs & then share with edit rights to my gmail address, steve.loughran@ ;  send me a note and I'll look through.

yes, I'm one of the reviewers, but in my reviews I call out that I helped with the submission: fairness is everything.

Last year only one person, Raam Rosh Hai, took this offer up, And he got in, with his talk How to build a recommendation system overnight! This means that so far, all drafts which have been through this pre-review of submissions process, has a 100% success rate. And, if you look at the video, you'll see its a good talk: he deserved that place.


Anyway, Submission deadline: Feb 14. Conference June 10-12.  Happy to help with reviewing draft abstracts.

2018-01-08

Trying to Meltdown in Java -failing. Probably

Meltdown has made for an "interesting" week in computing, as everyone is learning about/revising their knowledge of Speculative Execution. FWIW, I'd recommend the latest version of Patterson and Hennessey, Computer Architecture A Quantitative Approach. Not just for its details on speculative execution, but because it is the best book on microprocessor architecture and design that anyone has ever written, and lovely to read. I could read it repeatedly and not get bored.(And I see I need to get the 6th edition!)

Stokes Croft drugs find

This weekend, rather than read Patterson and Hennessey(*) I had a go to see if you could implement the meltdown attack in Java, hence in mapreduce, spark, or other non-native JAR

My initial attempt failed provided the part only speculates one branch in.

More specifically "the range checking Java does on all array accesses blocks the standard exploit given steve's assumptions". You can speculatively execute the out of bounds query, but you can't read the second array at an offset which will trigger $L1 cache loading.

If there's a way to do a reference to two separate memory locations which doesn't trigger branching range checks, then you stand a chance of pulling it off. I tried that using the ? : operator pair, something like

String ref = data ? refA : ref B;

which I hoped might compile down to something like


mov ref, refB
cmp data, 0
cmovnz ref, refB

This would do the move of the reference in the ongoing speculative branch, so, if "ref" was referenced in any way, trigger the resolution

In my experiment (2009 macbook pro with OSX Yosemite + latest java 8 early access release), a branch was generated ... but there are some refs in the open JDK JIRA to using CMOV, including the fact that hotspot compiler may be generating it if it things the probability of the move taking place is high enough.

Accordingly, I can't say "the hotspot compiler doesn't generate exploitable codepaths", only "in this experiment, the hotspot compiler didn't appear to generate an exploitable codepath".

Now the code is done, I might try on a Linux VM with Java 9 to see what is emitted
  1. If you can get the exploit in, then you'd have access to other bits of the memory space of the same JVM, irrespective of what the OS does. That means one thread with a set of Kerberos tickets could perhaps grab the secrets of another. IT'd be pretty hard, given the way the JVM manages objects on the heap: I wouldn't know where to begin, but it would become hypothetically possible.
  2. If you can get native code which you don't trust loaded into the JVM, then it can do whatever it wants. The original meltdown exploit is there. But native code running in JVM is going to have unrestricted access to the entire address space of the JVM -you don't need to use meltdown to grab secrets from the heap. All meltdown would do here is offer the possibility of grabbing kernel space data —which is what the OS patch does.

Anyway, I believe my first attempts failed within the context of this experiment.

Code-wise, this kept me busy on Sunday afternoon. I managed to twist my ankle quite badly on a broken paving stone on the way to patisserie on Saturday, so sat around for an hour drinking coffee in Stokes Croft, then limped home, with all forms of exercise crossed off the TODO list for the w/e. Time for a bit of Java coding instead, as a break for what I'd been doing over the holiday (C coding a version of Ping which outputs CSV data and a LaTeX paper on the S3A committers)

It took as much time trying get hold of the OS/X disassembler for generated code as it did coding the exploit. Why so? Oracle have replaced all links in Java.sun.net which would point to the reference dynamic library with a 302 to the base Java page telling you how lucky you are that Java is embedded in cars. Or you see a ref to on-stack-replacement on a page in Project Kenai, under a URL which starts with https://kenai.com/, point your browser there and end up on http://www.oracle.com/splash/kenai.com/decommissioning/index.html and the message "We're sorry the kenai.com site has closed."

All the history and knowledge on JVM internals and how to work there is gone. You can find the blog posts from four years ago on the topic, but the links to the tools are dead.

This is truly awful. It's the best argument I've seen for publishing this info as PDF files with DOI references, where you can move the artifact around, but citeseer will always find it. If the information doesn't last five years, then

The irony is, it means that because Oracle have killed all those inbound links to Java tools, they're telling the kind of developer who wants to know these things to go away. That's strategically short-sighted. I can understand why you'd want to keep the cost of site maintenance down, but really, breaking every single link? It's a major loss to the Java platform —especially as I couldn't even find a replacement.

I did manage to find a copy of the openjdk tarball people send you could D/L and run make on, but it was on a freebsd site, and even after a ./Configure && make, it broke trying to create a bsd dynlib. Then I checked out the full openjdk source tree, branch -8, installed the various tools and tried to build there. Again, some error. I ended up finding a copy of the needed hsdis-amd64.dylib library on Github, but I had to then spend some time looking at evolvedmicrobe's work &c to see if I could trust this to "probably" not be malware itself. I've replicated the JAR in the speculate module, BTW.

Anyway, once the disassembler was done and the other aspects of hotspot JIT compilation clear (if you can't see the method you wrote, run the loop a few thousand more times), I got to see some well annotated x86-64 assembler. Leaving me with a new problem: x86-64 assembler. It's a lot cleaner than classic 32 bit x86: having more registers does that, especially as it gives lots of scope for improving how function parameters and return values are managed.

What next? This is only a spare time bit of work, and now I'm back from my EU-length xmas break, I'm doing other things. Maybe next weekend I'll do some more. At least now I know that exploiting meltdown from the JVM is not going be straightforward.

Also I found it quite interesting playing with this, to see when the JVM kicks out native code, what it looks like. We code so far from the native hardware these days, its too "low level". But the new speculation-side-channel attacks have shown that you'd better understand modern CPU architectures, including how your high-level code gets compiled down.

I think I should submit a berlin buzzwords talk on this topic.

(*) It is traditional to swap the names of the author on every use. If you are a purist you have to remember the last order you used.

2018-01-04

Speculation


Speculative execution has been intel's strategy for keeping the x86 architecture alive since the P6/Pentium Pro part shipped in '95.

I remember coding explicitly for the P6 in a project in 1997; HPLabs was working with HP's IC Division to build their first CMOS-camera IC, which was an interesting problem. Suddenly your IC design needs to worry about light, aligning the optical colour filter with the sensors, making sure it all worked.

Eyeris

I ended up writing the code to capture the raw data at full frame rate, streaming to HDD, with an option to alternatively render it with/without the colour filtering (algorithms from another bit HPL team). Which means I get to nod knowingly when people complain about "raw" data. Yes, it's different for every device precisely because its raw.

The data rates of the VGA-resolution sensor via the PCI boards used to pull this off meant that a both cores of a multiprocessor P6 box were needed. It was the first time I'd ever had a dual socket system, but both sockets were full with the 150MHz parts and with careful work we could get away with the "full link rate" data capture which was a core part of the qualification process. It's not enough to self test the chips any more see, you need to look at the pictures.

Without too many crises, everything came together, which is why I have a framed but slightly skewed IC part to hand. And it's why I have memories of writing multithreaded windows C++ code with some of the core logic in x86 assembler. I also have memories of ripping out that ASM code as it turned out that it was broken, doing it as C pointer code and having it be just as fast. That's because: C code compiled to x86 by a good compiler, executed on a great CPU, is at least performant as hand-written x86 code by someone who isn't any good at assembler, and can be made to be correct more easily by the selfsame developer.

150 MHz may be a number people laugh at today, but the CPU:RAM clock ratios weren't as bad as they are today: cache misses are less expensive in terms of pipeline stalls, and those parts were fast. Why? Speculative and out of order execution, amongst other things
  1. The P6 could provisionally guess which way a branch was going to go, speculatively executing that path until it became clear whether or not the guess was correct -and then commit/abort that speculative code path.
  2. It uses a branch predictor to make that guess on the direction a branch was taken, based on the history of previous attempts, and a default option (FWIW, this is why I tend to place the most likely outcome first in my if() statements; tradition and superstition).
  3. It could execute operations out of order. That is, it's predecessor, the P5, was the last time mainstream intel desktop/server parts executed x86 code in the order the compiler generated them, or the human wrote them.
  4. register renaming meant that even though the parts had a limited set of registers, those OOO operations could reuse the same EAX, EBX, ECX registers without problems.
  5. It had caching to deal with the speed mismatch between that 150 MHz CPU & RAM.
  6. It supported dual CPU desktops, and I believe quad-CPU servers too. They'd be called "dual core" and "quad core" these days and looked down at.

Being the first multicore system I'd ever used, it was a learning experience. First was learning how too much windows NT4 code was still not stable in such a world. NTFS crashes with all all volumes corrupted? check. GDI rendering triggering kernel crash? check. And on a 4-core system I got hold of, everything crashed more often. Lesson: if you want a thread safe OS, give your kernel developers as many cores as you can.

OOO forced me to learn about the x86 memory model itself: barrier opcodes, when things could get reordered and when they wouldn't. Summary: don't try and be clever about synchronization, as your assumptions are invalid.

Speculation is always an unsatisfactory solution though. Every mis-speculation is lost cycles. And on a phone or laptop, that's wasted energy as much as time. And failed reads could fill up the cache with things you didn't want. I've tried to remember if I ever tried to use speculation to preload stuff if present, but doubt it. The CMOV command was a non-branching conditional assignment which was better, even if you had to hand code it.  The PIII/SSE added the PREFETCH opcode so you could a non-faulting hinted prefetch which you could stick into your non-branching code, but that was a niche opcode for people writing games/media codecs &c. And as Linus points out, what was clever for one CPU model turns out to be a stupid idea a generation later. (arguably, that applies to Itanium/IA-64, though as it didn't speculate, it doesn't suffer from the Spectre & Meltdown attacks).

Speculation, then: a wonderful use of transistors to compensate for how we developers write so many if() statements in our code. Wonderful, it kept the x86 line alive and so helped Intel deliver shareholder value and keep the RISC CPU out of the desktop, workstation and server businesses. Terrible because :"transistors" is another word for "CPU die area" with its yield equations and opportunity cost, and also for "wasted energy on failed speculations". If we wrote code which had fewer branches in it, and that got compiled down to CMOV opcodes, life would be better. But we have so many layers of indirection these days; so many indirect references to resolve before those memory accesses. Things are probably getting worse now, not better.

This week's speculation-side-channel attacks are fascinating then. These are very much architectural issues about speculation and branch prediction in general, rather than implementation details. Any CPU manufacturer whose parts do speculative execution has to be worried here, even if there's no evidence that your shipping parts aren't vulnerable to the current set of attacks. The whole point about speculation is to speed up operation based on the state of data held in registers or memory, so the time-to-execute is always going to be a side-channel providing information about the data used to make a branch.


The fact that you can get at kernel memory, even from code running under a hypervisor, means, well, a lot. It means that VMs running in cloud infrastructure could get at the data of the host OS and/or those of other VMs running on the same host (those S3 SSE-C keys you passed up to your VM? 0wned, along with your current set of IAM role credentials). It potentially means that someone else's code could be playing games with branch prediction to determine what codepaths your code is taking. Which, in public cloud infrastructure is pretty serious, as the only way to stop people running their code alongside yours is currently to pay for the top of the line VMs and hope they get a dedicated part. I'm not even sure that dedicated cores in a multicore CPU are sufficient isolation, not for anything related to cache-side-channel attacks (they should be good for branch prediction, I think, if the attacker can't manipulate the branch predictor of the other cores).

I can imagine the emails between cloud providers and CPU vendors being fairly strained, with the OEM/ODM teams on the CC: list. Even if the patches being rolled out mitigate things, if the slowdown on switching to kernelspace is as expensive as hinted, then that slows down applications, which means that the cost of running the same job in-cloud just got more expensive. Big cloud customers will be talking to their infrastructure suppliers on this, and then negotiating discounts for the extra CPU hours, which is a discount the cloud providers will expected to recover when they next buy servers. I feel as sorry for the cloud CPU account teams as I do for the x86 architecture group.

Meanwhile, there's an interesting set of interview questions you could ask developers on this topic.
  1. What does the generated java assembly for the Ival++ on a java long look like?
  2. What if the long is marked as volatile?
  3. What does the generated x86 assembler for a Java Optional<AtomicLong> opt.map(AtomicLong::addAndGet(1)) look like?
  4. What guarantees do you get about reordering?
  5. How would you write code which attempted to defend against speculation timing attacks?

I don't have the confidence to answer 1-4 myself, but I could at least go into detail about what I believed to be the case for 1-3; for #4 I should do some revision.

As for #5, defending. I would love to see what others suggest. Conditional CMOV ops could help against branch-prediction attacks, by eliminating the branches. However, searching for references to CMOV and the JDK turns up some issues which imply that branch prediction can sometimes be faster...", including "JDK-8039104. Don't use Math.min/max intrinsic on x86" it may be that even CMOV gets speculated on; with the CPU prefetching what is moved and keeping the write uncommitted until the state of the condition is known.

I suspect that the next edition of Hennessy and Patterson, "Computer Architecture, a Quantitative Approach" will be covering this topic.I shall look forward to with even greater anticipation than I have had for all the previous, beloved, versions.

As for all those people out there panicking about this, worrying if their nearly-new laptop is utterly exposed? You are running with Flash enabled on a laptop you use in cafe wifis without a VPN and with the same password, "k1tten",  you use for gmail and paypal. You have other issues.

2017-11-23

How to play with the new S3A committers

Untitled

Following up from yesterday's post on the S3A committers, here's what you need for picking up the committers.
  1. Apache Hadoop trunk; builds to 3.1.0-SNAPSHOT:  
  2. The documentation on use.
  3. An AWS keypair, try not to commit them to git. Tip for the Uber team: git-secrets is something you can add as a checkin hook. Do as I do: keep them elsewhere.
  4. If you want to use the magic committer; turn S3Guard on. Initially I'd use the staging committer, specificially the "directory" on.
  5. switch s3a:// to use that committer: fs.s3a.committer.name =  partitioned
  6. Run your MR queries
  7. look in _SUCCESS for committer info. 0-bytes long: classic FileOutputCommitter. Bit of JSON naming committer, files committed and some metrics (SuccessData) and you are using an S3 committer.
If you do that: I'd like to see the numbers comparing FileOutputCommitter (which must have S3Guard) and the new committers. For benchmark consistency, leave S3Guard on.

If you can't get things to work because the docs are wrong: file a JIRA with a patch. If the code is wrong: submit a patch with the fix & tests.

Spark?
  1. Spark Master has a couple of patches to deal with integration issues (FNFE on magic output paths, Parquet being over-fussy about committers, I think the committer binding has enough workarounds for these to work with Spark 2.2 though.
  2. Checkout my cloud-integration for Apache Spark repo, and its production-time redistributable, spark-cloud-integration.
  3. Read its docs and use
  4. If you want to use Parquet over other formats, use this committer.  
  5. Again,. check _SUCCESS to see what's going on.
  6. There's a test module with various (scaleable) tests as well as a copy and paste of some of the Spark SQL test.
  7. Spark can work with the Partitioned committer. This is a staging committer which only worries about file conflicts in the final partitions. This lets you do in-situ updates of existing datasets, adding new partitions or overwriting existing ones, while leaving the rest alone. Hence: no need to move the output of a job into the reference datasets.
  8. Problems. File an issue. I've just seen Ewan has a couple of PRs I'd better look at, actually.
Committer-wise, that spark-cloud-integration module is ultimately transient. I think we can identify those remaining issues with committer setup in spark core, after which a hadoop 3.0+ specific module should be able to work out the box with the new committers.

There's still other things there, like
  • Cloud store optimised file input stream source
  • ParallizedWithLocalityRDD: and RDD which lets you provide custom functions to declare locality on a row-by-row basis. Used in my demo of implementing DistCp in Spark. Every row is a filename, which gets pushed out to a worker close to the data, it does the upload. This is very much a subset of distCP, but it shows this: you can have with with RDDs and cloud storage.
  • + all the tests
I think maybe Apache Bahir would be the ultimate home for this. For now, a bit too unstable.

(photo: spices on sale in a Mombasa market)