2018-04-02

Computer Architecture and Software Security


Gobi's End
There's a new paper covering another speculative excuation-based attack on system secrets, BranchScope.

This one relies on the fact that for branch prediction to be effective, two bits are generally allocated to it, strongly & weakly taken and strongly & weakly not taken. The prediction state of a branch is based on the value in BranchHistoryTable[hash(address)]) and used to choose the speculation; if it was wrong it is moved from strongly -> weakly, and from weakly to opposite. Similarly, in weakly taken/non taken, if the prediction was taken, then its moves to strong.

Why so complex? Because we loop all the time
for (int i = 0; i < 1000) {
  doSomething(i);
}

Which probably gets translated into some assembly code (random CPU language I just made up)

    MOV  r1, 0
L1: CMP r1, 999
    JGT end
    JSR DoSomething
    ADD r1, 1
    JMP  L1
    ... continue

For 1000 times in that loop. the branch is taken, then once, at the end of the loop, it's not taken. The first time it's encountered, the CPU won't know what to do, it will just guess one of them and have a 50% chance of being wrong (see below). After that first iteration though it'll guess right, until the final test fails and the loop is exited. If that loop is itself called repeatedly, the fact that final iteration was mispredicted shouldn't lose the fact that the rest of the loop was predicted repeatedly. Hence, two bits.

As Hennessey and Patterson write in Computer Architecture, a quantitive approach (v4, p89), "the importance of branch prediction has increased". With deeper pipelines and the mismatch of CPU speed and memory, guessing right matters.

There isn't enough space in the Branch History Table to store 2 bits of history for every single branch in a system, so instead there'll be some smaller table and some function to take the full address and map it to an offset in that table. According to [Pan92], 4096 to 8192 entries is not that far off "an infinite" table. All that's left is the transform from program counter to BHT entry, which for 32 bit aligned opcodes something as simple as (PC >> 4) & 8191.

But the table is not infinite, there will be clashes: if something else is using the same entry in the BHT, then your branch may be predicted according to its history.

The new attack then simply works out the taken/not taken state of the target branch by seeing how your own code, whose addresses are designed to conflict, is predicted. That's all. And given that ability to predict branch direction, using it to reach conclusions about the state of the system.

Along with caching, branch prediction is the key way in which modern CPUs speed things up. And it does. But it's the clash between your entries in the cache and BHT and that of the target routine which is leaking information: how long it takes to read things, whether a branch is predicted or not. The very act of speeding up code is what leaks secrets.

"Modern" CPU Microarchitecture is in trouble here. We've put decades of work into caching, speculation, branch prediction, and now they all turn out to expose information. We built for speed, at what turns out to be the cost of secrecy. And in cloud environments where you cannot stop malicious code running on the same CPU, that means your secrets are not safe.

What can we do?

Maybe another microcode patch is possible: when switching from usermode to OS mode then the BHT is flushed. But that will cripple performancve in any loop which invokes system code in it. Or you somehow isolate BHT entries for different virtual memory spaces. Probably the best long term, but I'll leave it to others to work out how to implement.

What's worrying is the fact that new exploits are appearing so soon after Meltdown and Spectre. Security experts are now looking at all of the speculative execution bits of modern CPUs and thinking "that's interesting..."; more exploits are inevitable. And again, systems, especially cloud infrastructures, will be left struggling to catch up.

Cloud infrastructures are probably going to have to pin every VM to a dedicated CPU, with the hypervisor on its own part. That will limit secret exfiltration to the VM OS and anything else running on the core (the paper looks at the intel SGX "secure" zone and showed how it can be targeted). It'll be the smaller VMs at risk here, and potentially containerized stuff: you'd want all containers on a single core to be "yours".

What about single-core systems running a mix of trusted and trusted code (your phone, your web browser)? That's going to be hard. You can't dedicate one x86 core per browser tab.

Longer term: we're going to have to go through every bit of modern CPU architecture from a security perspective and say "is this safe?" And no doubt conclude, any speedup mechanism which relies on the history of previous work is insecure, if that history includes the actions taken (or speculatively taken) by sensitive applications.

Which is bad news for the majority of today's high end CPUs, especially those ones trying to keep the x86 instruction set alive. Those are the parts which have had so much effort invested into getting fractional improvements in caching, branch prediction, speculation and pipeline efficiency, and so have gotten incredibly complex. That's where the big vulnerabilities live.

This may push us back towards "underperformant but highly parallel" massivley multicore systems. Little/no speculation, isolating user space code into their own processes.

The most recent example of this is/was the Sun Niagara CPU line, which started off with a pool of early-90s era SPARC CPUs without fancy branch prediction...intead they had 4 set of state to cover the entire execution state of four different threads, scheduling work between them. Memory access? Stall that thread, schedule another. Branch? Don't predict, just wait and see, and add other thread opcodes to the pipeline.

There's still going to be security issues there (cache shared across the many cores, the actions of one thread can be implicitly observed by others in their execution times). And it seemly does speculate memory loads if there was no other work to schedule.

What's equally interesting is that the system is so power efficient. Speculative execution and branch prediction (a) requires lots of gates, renamed registeres, branch history tables and the like —every missed prediction or branch is energy wasted. Compare that to an Itanium part, where you almost need to phone up your electricity supplier for permission to power one up.

The Niagara 2 part pushed it ahead further to a level that is impressive to read. At the same time, you can see a great engineering team struggling with a fab process behind what Intel could do, Sun trying to fight the x86 line, and, well, losing.

Where are the parts now? Oracle's M8 CPU PDF touts its Out Of Order execution —speculative execution—, and data/instruction prefetch. I fear it's now got the same weaknesses of everything else. Apparently the java 8 streams API gets bonus speedup, which reminds me to post something criticising Java checked execution for making that API unusable for the throws IOException Hadoop codebase. As for the virtualization support, again, you'd need to think about pinning to a CPU. There's also that $L1-$L3 cache hit/miss problem: something speculating in one CPU could evict cached data observable to others, unless speculative memory fetches weren't a feature of the part.

They look nice-but-pricey servers; if you are paying the Oracle RDBMs tax the all-in-one price might mitigate that. Overall though, with a focus on many fast-but-dim parts over a smaller number of "throw Si at maximum single thread" architecture of recent x86 designs may provide opportunities for future designs to be more resistant to attacks related to speculative execution. I also think I'd like to see their performance numbers running Apache Spark 2.3 with one executor per thread and lots of RAM.

Update April 3 2018: I see within hours of this posting rumour start that Apple is looking at ARM parts for macbooks in 2020+. Not a coincidence! Actually it is, but because the ARM parts are simpler, they may be less exposed to specex-based attacks, even though Meltdown did affect those implementations which did speculative memory fetches. I think the Niagara architecture has more potential, but it probably works best in massively-multithreaded server side systems, not laptops where latency is the performance metric, not throughput.

[Photo: my 2008 Fizik Gobi saddle snapped one of its Titanium rails last week. Made it home in the rain, but a sign that after a decade, parts just wear out.]