Time on multi-core, multi-socket servers

Stokes Croft Graffiti, Sept 2015

In Distributed Computing the notion of "when-ness" is fundamental; Lamport's "Time, Clocks, and the. Ordering of Events in a Distributed System" paper is considered one of the foundational pieces of work.

But what about locally?

in the Java APIs, we have: System.currentTimeMillis() and System.nanoTime() to return time.

we experienced developers "know" that currentTimeMillis() is on the "wall clock", so that if things happen to that clock: manual/NTP clock shifts, VM migration, that time can suddenly jump to a new value. And for that reason, nanoTime() is the one that we should really be using to measure time, monotonically.

Except I now no longer trust it. I've known for a long time that CPU frequency could change its rate, but as of this week I've now discovered that on a multi-socket (And older multi-core system), the nanoTime() value may be or more of:

  1. Inconsistent across cores, hence non-monotonic on reads, especially reads likely to trigger thread suspend/resume (anything with sleep(), wait(), IO, accessing synchronized data under load).
  2. Not actually monotonic.
  3. Achieving a consistency by querying heavyweight counters with possible longer function execution time and lower granularity than the wall clock.
That is: modern NUMA, multi-socket servers are essentially multiple computers wired together, and we have a term for that: distributed system.

The standard way to read nanotime on an x86 part is reading the TSC counter, via the RDTSC opcode. Lightweight, though actually a synchronization barrier opcode.

Except every core in a server may be running at a different speed, and so have a different value for that counter. When code runs across cores, different numbers can come back.

In Intel's Nephalem chipset the TSC is shared across all cores on the same die, and clocked at a rate independent of the CPU: monotonic and consistent across the entire socket. Threads running in any core in the same die will get the same number from RDTSC —something that System.nanoTime() may use.

Fill in that second socket on your server, and you have lost that consistency, even if the parts and their TSC counters are running forwards at exactly the same rate. Any code you had which relied on TSC consistency is now going to break.

This is all ignoring virtualization: the RDSTC opcode may or may not be virtual. If it is: you are on your own.

Operating systems are aware of this problem, so may use alternative mechanisms to return a counter: which may be neither monotonic nor fast.

Here then, is some reading on the topic
The conclusion I've reached is that except for the special case of using nanoTime() in micro benchmarks, you may as well stick to currentTimeMillis() —knowing that it may sporadically jump forwards or backwards. Because if you switched to nanoTime(), you don't get any monotonicity guarantees, it doesn't relate to human time any more —and may be more likely to lead you into writing code which assumes a fast call with consistent, monotonic results.