Showing posts with label oracle. Show all posts
Showing posts with label oracle. Show all posts

2018-04-02

Computer Architecture and Software Security


Gobi's End
There's a new paper covering another speculative excuation-based attack on system secrets, BranchScope.

This one relies on the fact that for branch prediction to be effective, two bits are generally allocated to it, strongly & weakly taken and strongly & weakly not taken. The prediction state of a branch is based on the value in BranchHistoryTable[hash(address)]) and used to choose the speculation; if it was wrong it is moved from strongly -> weakly, and from weakly to opposite. Similarly, in weakly taken/non taken, if the prediction was taken, then its moves to strong.

Why so complex? Because we loop all the time
for (int i = 0; i < 1000) {
  doSomething(i);
}

Which probably gets translated into some assembly code (random CPU language I just made up)

    MOV  r1, 0
L1: CMP r1, 999
    JGT end
    JSR DoSomething
    ADD r1, 1
    JMP  L1
    ... continue

For 1000 times in that loop. the branch is taken, then once, at the end of the loop, it's not taken. The first time it's encountered, the CPU won't know what to do, it will just guess one of them and have a 50% chance of being wrong (see below). After that first iteration though it'll guess right, until the final test fails and the loop is exited. If that loop is itself called repeatedly, the fact that final iteration was mispredicted shouldn't lose the fact that the rest of the loop was predicted repeatedly. Hence, two bits.

As Hennessey and Patterson write in Computer Architecture, a quantitive approach (v4, p89), "the importance of branch prediction has increased". With deeper pipelines and the mismatch of CPU speed and memory, guessing right matters.

There isn't enough space in the Branch History Table to store 2 bits of history for every single branch in a system, so instead there'll be some smaller table and some function to take the full address and map it to an offset in that table. According to [Pan92], 4096 to 8192 entries is not that far off "an infinite" table. All that's left is the transform from program counter to BHT entry, which for 32 bit aligned opcodes something as simple as (PC >> 4) & 8191.

But the table is not infinite, there will be clashes: if something else is using the same entry in the BHT, then your branch may be predicted according to its history.

The new attack then simply works out the taken/not taken state of the target branch by seeing how your own code, whose addresses are designed to conflict, is predicted. That's all. And given that ability to predict branch direction, using it to reach conclusions about the state of the system.

Along with caching, branch prediction is the key way in which modern CPUs speed things up. And it does. But it's the clash between your entries in the cache and BHT and that of the target routine which is leaking information: how long it takes to read things, whether a branch is predicted or not. The very act of speeding up code is what leaks secrets.

"Modern" CPU Microarchitecture is in trouble here. We've put decades of work into caching, speculation, branch prediction, and now they all turn out to expose information. We built for speed, at what turns out to be the cost of secrecy. And in cloud environments where you cannot stop malicious code running on the same CPU, that means your secrets are not safe.

What can we do?

Maybe another microcode patch is possible: when switching from usermode to OS mode then the BHT is flushed. But that will cripple performancve in any loop which invokes system code in it. Or you somehow isolate BHT entries for different virtual memory spaces. Probably the best long term, but I'll leave it to others to work out how to implement.

What's worrying is the fact that new exploits are appearing so soon after Meltdown and Spectre. Security experts are now looking at all of the speculative execution bits of modern CPUs and thinking "that's interesting..."; more exploits are inevitable. And again, systems, especially cloud infrastructures, will be left struggling to catch up.

Cloud infrastructures are probably going to have to pin every VM to a dedicated CPU, with the hypervisor on its own part. That will limit secret exfiltration to the VM OS and anything else running on the core (the paper looks at the intel SGX "secure" zone and showed how it can be targeted). It'll be the smaller VMs at risk here, and potentially containerized stuff: you'd want all containers on a single core to be "yours".

What about single-core systems running a mix of trusted and trusted code (your phone, your web browser)? That's going to be hard. You can't dedicate one x86 core per browser tab.

Longer term: we're going to have to go through every bit of modern CPU architecture from a security perspective and say "is this safe?" And no doubt conclude, any speedup mechanism which relies on the history of previous work is insecure, if that history includes the actions taken (or speculatively taken) by sensitive applications.

Which is bad news for the majority of today's high end CPUs, especially those ones trying to keep the x86 instruction set alive. Those are the parts which have had so much effort invested into getting fractional improvements in caching, branch prediction, speculation and pipeline efficiency, and so have gotten incredibly complex. That's where the big vulnerabilities live.

This may push us back towards "underperformant but highly parallel" massivley multicore systems. Little/no speculation, isolating user space code into their own processes.

The most recent example of this is/was the Sun Niagara CPU line, which started off with a pool of early-90s era SPARC CPUs without fancy branch prediction...intead they had 4 set of state to cover the entire execution state of four different threads, scheduling work between them. Memory access? Stall that thread, schedule another. Branch? Don't predict, just wait and see, and add other thread opcodes to the pipeline.

There's still going to be security issues there (cache shared across the many cores, the actions of one thread can be implicitly observed by others in their execution times). And it seemly does speculate memory loads if there was no other work to schedule.

What's equally interesting is that the system is so power efficient. Speculative execution and branch prediction (a) requires lots of gates, renamed registeres, branch history tables and the like —every missed prediction or branch is energy wasted. Compare that to an Itanium part, where you almost need to phone up your electricity supplier for permission to power one up.

The Niagara 2 part pushed it ahead further to a level that is impressive to read. At the same time, you can see a great engineering team struggling with a fab process behind what Intel could do, Sun trying to fight the x86 line, and, well, losing.

Where are the parts now? Oracle's M8 CPU PDF touts its Out Of Order execution —speculative execution—, and data/instruction prefetch. I fear it's now got the same weaknesses of everything else. Apparently the java 8 streams API gets bonus speedup, which reminds me to post something criticising Java checked execution for making that API unusable for the throws IOException Hadoop codebase. As for the virtualization support, again, you'd need to think about pinning to a CPU. There's also that $L1-$L3 cache hit/miss problem: something speculating in one CPU could evict cached data observable to others, unless speculative memory fetches weren't a feature of the part.

They look nice-but-pricey servers; if you are paying the Oracle RDBMs tax the all-in-one price might mitigate that. Overall though, with a focus on many fast-but-dim parts over a smaller number of "throw Si at maximum single thread" architecture of recent x86 designs may provide opportunities for future designs to be more resistant to attacks related to speculative execution. I also think I'd like to see their performance numbers running Apache Spark 2.3 with one executor per thread and lots of RAM.

Update April 3 2018: I see within hours of this posting rumour start that Apple is looking at ARM parts for macbooks in 2020+. Not a coincidence! Actually it is, but because the ARM parts are simpler, they may be less exposed to specex-based attacks, even though Meltdown did affect those implementations which did speculative memory fetches. I think the Niagara architecture has more potential, but it probably works best in massively-multithreaded server side systems, not laptops where latency is the performance metric, not throughput.

[Photo: my 2008 Fizik Gobi saddle snapped one of its Titanium rails last week. Made it home in the rain, but a sign that after a decade, parts just wear out.]

2018-01-08

Trying to Meltdown in Java -failing. Probably

Meltdown has made for an "interesting" week in computing, as everyone is learning about/revising their knowledge of Speculative Execution. FWIW, I'd recommend the latest version of Patterson and Hennessey, Computer Architecture A Quantitative Approach. Not just for its details on speculative execution, but because it is the best book on microprocessor architecture and design that anyone has ever written, and lovely to read. I could read it repeatedly and not get bored.(And I see I need to get the 6th edition!)

Stokes Croft drugs find

This weekend, rather than read Patterson and Hennessey(*) I had a go to see if you could implement the meltdown attack in Java, hence in mapreduce, spark, or other non-native JAR

My initial attempt failed provided the part only speculates one branch in.

More specifically "the range checking Java does on all array accesses blocks the standard exploit given steve's assumptions". You can speculatively execute the out of bounds query, but you can't read the second array at an offset which will trigger $L1 cache loading.

If there's a way to do a reference to two separate memory locations which doesn't trigger branching range checks, then you stand a chance of pulling it off. I tried that using the ? : operator pair, something like

String ref = data ? refA : ref B;

which I hoped might compile down to something like


mov ref, refB
cmp data, 0
cmovnz ref, refB

This would do the move of the reference in the ongoing speculative branch, so, if "ref" was referenced in any way, trigger the resolution

In my experiment (2009 macbook pro with OSX Yosemite + latest java 8 early access release), a branch was generated ... but there are some refs in the open JDK JIRA to using CMOV, including the fact that hotspot compiler may be generating it if it things the probability of the move taking place is high enough.

Accordingly, I can't say "the hotspot compiler doesn't generate exploitable codepaths", only "in this experiment, the hotspot compiler didn't appear to generate an exploitable codepath".

Now the code is done, I might try on a Linux VM with Java 9 to see what is emitted
  1. If you can get the exploit in, then you'd have access to other bits of the memory space of the same JVM, irrespective of what the OS does. That means one thread with a set of Kerberos tickets could perhaps grab the secrets of another. IT'd be pretty hard, given the way the JVM manages objects on the heap: I wouldn't know where to begin, but it would become hypothetically possible.
  2. If you can get native code which you don't trust loaded into the JVM, then it can do whatever it wants. The original meltdown exploit is there. But native code running in JVM is going to have unrestricted access to the entire address space of the JVM -you don't need to use meltdown to grab secrets from the heap. All meltdown would do here is offer the possibility of grabbing kernel space data —which is what the OS patch does.

Anyway, I believe my first attempts failed within the context of this experiment.

Code-wise, this kept me busy on Sunday afternoon. I managed to twist my ankle quite badly on a broken paving stone on the way to patisserie on Saturday, so sat around for an hour drinking coffee in Stokes Croft, then limped home, with all forms of exercise crossed off the TODO list for the w/e. Time for a bit of Java coding instead, as a break for what I'd been doing over the holiday (C coding a version of Ping which outputs CSV data and a LaTeX paper on the S3A committers)

It took as much time trying get hold of the OS/X disassembler for generated code as it did coding the exploit. Why so? Oracle have replaced all links in Java.sun.net which would point to the reference dynamic library with a 302 to the base Java page telling you how lucky you are that Java is embedded in cars. Or you see a ref to on-stack-replacement on a page in Project Kenai, under a URL which starts with https://kenai.com/, point your browser there and end up on http://www.oracle.com/splash/kenai.com/decommissioning/index.html and the message "We're sorry the kenai.com site has closed."

All the history and knowledge on JVM internals and how to work there is gone. You can find the blog posts from four years ago on the topic, but the links to the tools are dead.

This is truly awful. It's the best argument I've seen for publishing this info as PDF files with DOI references, where you can move the artifact around, but citeseer will always find it. If the information doesn't last five years, then

The irony is, it means that because Oracle have killed all those inbound links to Java tools, they're telling the kind of developer who wants to know these things to go away. That's strategically short-sighted. I can understand why you'd want to keep the cost of site maintenance down, but really, breaking every single link? It's a major loss to the Java platform —especially as I couldn't even find a replacement.

I did manage to find a copy of the openjdk tarball people send you could D/L and run make on, but it was on a freebsd site, and even after a ./Configure && make, it broke trying to create a bsd dynlib. Then I checked out the full openjdk source tree, branch -8, installed the various tools and tried to build there. Again, some error. I ended up finding a copy of the needed hsdis-amd64.dylib library on Github, but I had to then spend some time looking at evolvedmicrobe's work &c to see if I could trust this to "probably" not be malware itself. I've replicated the JAR in the speculate module, BTW.

Anyway, once the disassembler was done and the other aspects of hotspot JIT compilation clear (if you can't see the method you wrote, run the loop a few thousand more times), I got to see some well annotated x86-64 assembler. Leaving me with a new problem: x86-64 assembler. It's a lot cleaner than classic 32 bit x86: having more registers does that, especially as it gives lots of scope for improving how function parameters and return values are managed.

What next? This is only a spare time bit of work, and now I'm back from my EU-length xmas break, I'm doing other things. Maybe next weekend I'll do some more. At least now I know that exploiting meltdown from the JVM is not going be straightforward.

Also I found it quite interesting playing with this, to see when the JVM kicks out native code, what it looks like. We code so far from the native hardware these days, its too "low level". But the new speculation-side-channel attacks have shown that you'd better understand modern CPU architectures, including how your high-level code gets compiled down.

I think I should submit a berlin buzzwords talk on this topic.

(*) It is traditional to swap the names of the author on every use. If you are a purist you have to remember the last order you used.

2011-10-10

Oracle and Hadoop part 2: Hardware -overkill?

Montpelier Street Scene

[is the sequel to Part 1]

I've been looking at what Oracle say you should run Hadoop on and thinking "why?"

I don't have the full specs, since all I've seen is slideware on the register implying this is premium hardware, not just x86 boxes with as many HDDs you can fit in a 1U with 1 or 2 multicore CPUs and an imperial truckload of DRAM. In particular, there's mentioning of Infiniband in there.

InfiniBand? Why? Is it to spread the story that the classic location-aware schedulers aren't adequate on "commodity" 12-core 64GB 24TB with 10GbE interconnect? Or are there other plans afoot?

Well, one thing to consider is the lead time for new rack-scale products, some of this exascale stuff will predate the oracle takeover, and the design goals at the time "run databases fast" met Larry's needs more than the rest of the Sun product portfolio -though he still seems to dream of making a $ or two from every mobile phone on the planet.

The question for Oracle has been "how to get from hardware proto to shipping what they can say is the best Oracle server." Well, one tactic is to identify the hardware that runs Oracle better and stop supporting it. It's what they are doing against HP's servers, and will no doubt try against IBM when the opportunity arises. That avoids all debate about which hardware to run Oracle on. It's Oracle's. Next question? Support costs? Wait and see.

While all this hardware development was going on, the massive-low-cost GFS/HDFS filesystem with integrate compute was sneaking up on the sides. Yes, it's easy to say -as Stonebraker did- that MapReduce is a step backwards. But it scales, not just technically, but financially. Look at the spreadsheets here. Just as Larry and Hurd -who also seemed over-fond of the Data Warehouse story- are getting excited about Oracle on Oracle hardware, somewhere your data enters but never leaves(*), somebody has to break them the bad news that people have discovered an alternative way to store and process data. One that doesn't need ultra-high-end single server designs, one that doesn't need oracle licenses, and one that doesn't need you to pay for storage at EMC's current rates. That must have upset Larry, and kept him and the team busy on a response.

What they have done is defensive actions: Hadoop as a way of storing the low value data near Oracle RDBMS, for you to use it as the Extract-Transform-Load part of the story. Where it does fit in, as you can offload some of the grunge work to lower end machines, the storage to SATA. It's no different from keeping log data in HDFS but the high value data in HBase on HDFS, or -better yet IMO- HP Vertica.

For that story to work best, you shouldn't overspec the hardware with things like InfiniBand. So why has that been done?

  • Hypothesis 1: margins are better, helps stop people going to vendors (HP, Rackable), that can sell the servers that work best in this new world.
  • Hypothesis 2: Oracle's plans in the NoSQL world depend on this interconnect.
  • Hypothesis 3: Hadoop MR can benefit from it.

Is Hypothesis 3 valid? Well, in Disk-Locality in Datacenter Computing Considered Irrelevant [Ananthanarayanan2011], Ganesh and colleagues argue that improvements in in-rack and backplane bandwidth will mean you won't care whether your code is running on the same server as your data, or even the same rack as your data. Instead you will worry about whether your data is in RAM or on HDD, as that is what slows you down the most. I agree that even today on 10GbE rack-local is as fast as server-local, but if we could take HDFS out the loop for server-local FS access, that difference may reappear. And while eliminating Ethernet's classic Spanning Tree forwarding algorithm for something like TRILL would be great, it's still going to cost a lot to get a few hundred Terabits/s over the backplane, so in large clusters rack-local may still have an edge. If not, well, there's still the site-local issue that everyone's scared of dealing with today.

Of more immediate interest is Can High-Performance Interconnects Benefit Hadoop Distributed File System?, [Sur2010]. This paper looks at what happens today if you hook up a Hadoop cluster over InfiniBand. They showed it could speed things up, even more if you went to SSD. But go there and -today- you massively cut back on your storage capacity. It is an interesting read, though it irritates me that they fault HDFS for not using the allocateDirect feature of Java NIO, and didn't file a bug or fix for that. See a problem, don't just write a paper saying "our code is faster than theirs". Fix the problem in both codebases and show the speedup is still there -as you've just removed one variable from the mix.


Anyway, even with that paper, 10GbE looks pretty good, it'll be built in to the new servers and if the NIO can be fixed, it's performance may get even closer to InfiniBand. You'd then have to move to alternate RPC mechanisms to get the latency improvements that InfiniBand promises.

Did the Oracle team have these papers in mind when they did the hardware? Unlikely, but they may have felt that IB offers a differentiator over 10GbE. Which it does, in cost terms, limits of scale and complexity of bringing up a rack.

They'd better show some performance benefits for that -either in Hadoop or the NoSQL DB offering they are promising.

(*) People call this the "roach motel" model, but I prefer to refer to Northwick Park Hospital in NW London. Patients enter, but they never come out alive.

2011-10-08

Oracle and Hadoop part 1

Under the M32

I suppose I should add a disclaimer that I am obviously biased against anything Oracle does, but I will start by praising them for:
  1. Recognising that the Hadoop platform is both a threat and an opportunity. 
  2. Doing something about it.
  3. Not rewriting everything from scratch under the auspices of some JCP committee.

Point #3 is very different from how Sun would work: they'd see something nice and try and suck it into the "Java Community Program", which would either get it all overweight from the weight that Standards Bodies apply to technology (EJB3 vs Hibernate), stall it until it becomes irrelevant (most things), or ruin a good idea by proposing a complete rewrite (Restlet vs JAX-RS). Overspecify the API, underspecify the failure modes and only provide access to the test suite if you agree to all of Sun's (and now Oracle's) demands.

No, they didn't go there. Which makes me worry: what is their plan. I don't see Larry taking to the idea of "putting all the data that belongs to companies and storing it a filesystem and processing platform that I don't own". I wonder if the current Hadoop+R announcement is a plan to get in to the game, but not all of it.

Or it could just be that they realised that if they invited Apache to join some committee on the topic they'd get laughed at. 

Sometime soon I will look at the H/W specs and wonder why that was so overspecified. Not today.

What I will say that irritated me about Oracle World's announcements was not Larry Ellison, it was Mark Hurd saying snide things about HP how Oracle's technologies were better. If that's the case, given the time to market of new systems, isn't a critique of Hurd himself? That's Mark "money spent on R&D is money wasted" Hurd? That's Mark whose focus on the next quarter meant that anything long term wasn't anything he believed in? That is Mark Hurd whose appointed CIO came from Walmart, and viewed any "unofficial" IT expenditure as something to clamp down on? Those of us doing agile and advanced stuff: complex software projects ended up using Tools that weren't approved: IntelliJ IDEA, JUnit tests, Linux desktops, Hudson running CI. The tooling we set up to get things done were the complete antithesis of the CIO's world view of one single locked down windows image for the entire company, a choice of two machines: "the approved laptop" and "the approved desktop". Any bit of the company that got something done had to go under the radar and do things without Hurd and his close friends noticing.

That's what annoyed me

(Artwork: something in Eastville. Not by Banksy, before anyone asks)