Software Performance: Cache Coherency And NUMA

Jun 14, 2022

Cache Coherency

Before we dive into examples of the sorts of things we can do to lower variance, a brief detour into the internals of the modern CPUs we run our code on. One important aspect of modern CPUs is their caches. Each CPU core has private L1 and L2 caches to store both data and instructions. Because all CPU cores access the same physical address space, CPU cores coordinate access to the same address to ensure a consistent view of a given memory location. Cache coherence protocols help manage this consistent view (note this consistency is at a lower level than the sorts of protection provided by software locks, cache coherence protocols do not obviate the need for locks and care to avoid race conditions).

Among the most foundational cache coherence protocols is MESI (Modified, Exclusive, Shared, Invalid). For each cache line in a cache, the MESI protocol (https://en.wikipedia.org/wiki/MESI_protocol) keeps a state machine for that cache line. There are some basic rules about the combinations and transitions of states permitted for each cache line:

Modified (M) Indicates the cache line is dirty and must be written back to main memory
Exclusive (E) Indicates this particular cache has exclusive access to the corresponding memory
- Only one copy of the cache line can exist in the Exclusive state
Shared (S) Indicates this cache line is neither dirty nor exclusively owned and is readable
- Multiple copies of the same cache line can exist in the Shared state in different caches
Invalid (I) Indicates another processor has written to this cache line and the local copy is out of date
- A read or write to this cache line on this core must coordinate with the remote cache holding the same cache line in the Modified or Exclusive state

Observing the above state transition diagram, when a CPU core attempts to write to a memory location (Address A), it must acquire the Exclusive state on the cache line containing that memory address first. This typically involves sending out an operation over the inter-core network/bus on the CPU itself to perform a Request For Ownership (RFO) operation. Once the writing CPU (Core W we’ll call it) has acquired the cache line in the E state, Core W can then modify the cache line with the desired value. Since Core W now has Exclusive ownership over the cache lines, copies of the cache line containing address A in other Cores R1, R2, .... must transition to state I (Invalid).

Core W then issues the write and the cache line in that CPUs cache moves to the Modified (M) state. When another core R1 then reads the cache line, or attempts a write, Core R1 must communicate with Core W to transition Core Ws copy to the Shared state (for a read on Core R1) or the Invalid state (for a write on Core R1). These cross core communication operations are not cheap. Example benchmarks I’ve run show migrating ownership of cache lines across cores can be up to the 100+ clock cycle range, around 50 nanoseconds on a typical 2GHz server processor. The diagram below shows the mesh network on a modern Intel CPU over which cache coherence operations traverse

1452px-skylake_sp_xcc_block_diagram.svg.png — Mesh Interconnect On Modern Intel CPU. Source: WikiChip

While the cache coherence protocol is negotiating over cache line states, the instruction wishing to access the cache line is stalled waiting for the cache line to arrive/enter the correct state. This sort of cache line “ping ponging” as it is known is incredibly expensive. From an Amdahl’s Law perspective, it creates a serialization point that dramatically reduces the parallel speed up.

The key takeaway is: Keep operations on a single thread, and ideally a single physical core. For each data structure, only read and write to that structure from a single core. The one exception here is data which is read-only or very very rarely modified. An example would be a configuration system that periodically swaps out a global pointer to shared state.

As a rule of thumb, achieving 100% linear scaling requires avoiding any sort of cross CPU core sharing of data. In practice any meaningful algorithm will require some sort of communication across cores, but it’s vital that cross-core communication is kept to the absolute minimum. Any communication between CPU cores is very expensive, and becomes serial work in Amdahl’s Law (https://en.wikipedia.org/wiki/Amdahl%27s_law).

NUMA

Things get even more complicated when there are two different processor packages on the same host. In such a case, NUMA = Non Uniform Memory Architecture becomes a factor. While parts of the industry is moving away from multi-socket NUMA designs in favor of single socket servers, the larger core counts in new CPUs lead to NUMA like behavior (NUCA). In multi-socket NUMA systems, penalties for core-to-core communication between cores in different NUMA domains are much much steeper. Any non-trivial frequency of cache line ping-ponging across caches on different sockets will completely destroy performance as a far more expensive cross-package hop is required.

It can also create strange artifacts with increased memory bandwidth, due to cache lines being written back to memory before being read by the other socket.

False Sharing

I’ve outlined how CPU core negotiate access to cache lines using coherence protocols. Next we look at a couple common performance problems caused by this behavior. The first is called False Sharing. False sharing occurs when two different cores compete with one another for access to the same cache line, but access different parts of the cache line (ie. the “sharing” of data is false). An example is shown below:

struct SystemContext {
  // Accessed by thread one
  uint32_t count;
  // Accessed by thread two
  SpinLock lock;
} context;

void threadOne() {
  while (...) {
    // Acquires Exclusive access to write the cache line
    // which contains context.count
    context.count++;
  }
}

void threadTwo() {
  while (...) {
    // Acquires Exclusive access to perform a write
    // to the cache line which contains context.lock
    context.lock.acquire();
  }
}

Both threads are attempting to access and modify the same cache line holding a SystemContext struct instance. But they read and modify disjoint portions of the struct and thus cache line. Performance in this case is incredibly bad, and dozens if not hundreds of CPU cycles are wasting communicating across the inter-core network/bus to perform coherence protocol operations moving ownership of that cache line back and forth between cores. The solution for false sharing is simple, add padding so portions of memory used by different CPU cores are on different cache lines:

struct SystemContext {
  __aligned(kCacheLineSize) uint32_t count;
  __aligned(kCacheLineSize) SpinLock lock;
} context;

It’s easy to accidentally cause false sharing in more subtle cases. Consider the following example:

struct ThreadContext {
  uint32_t aggregate;
  bool done;
}

std::vector<ThreadContext> contexts;

// Each thread running this logic
void threadMain(uint16_t threadIdx) {
  auto& context = contexts[threadIdx];
  while (...) {
    context.aggregate = ...;
  }
}

The problem with the above example is the contexts vector is laid out contiguously in memory. Thus ThreadContext structs associated with different threads overlap on the same cache lines. Same solution as before, add padding/alignment to keep each struct instance on different cache lines:

struct ThreadContext {
  uint32_t aggregate;
  bool done;
  uint8_t padding[kCacheLineSize];
};

True Sharing

True Sharing is similar to false sharing, with the distinction that each core accesses the same part of the cache line (the sharing is real, not fake/accidental, as in false sharing). An example of true sharing would be something like a shared atomic counter:

uint32_t getNextId() {
  static std::atomic<uint64_t> nextId{0};
  return nextId++;
}

In the above example, each time a thread calls getNextId(), the core running that thread must acquire Exclusive ownership of the cache line from the previous core which incremented the nextId local static variable. A much more subtle variant of this includes shared_ptr reference counts:

std::shared_ptr<T> ptr;

// Executes on core 4
void threadOne() {
  while (...) {
    // Performs a copy of ptr, issues an atomic
    // increment of the reference count, taking
    // ownership of the cache line
    func1(ptr);
  }
}

// Executes on core 3
void threadTwo() {
  while (...) {
    // Runs next and also performs a copy of ptr.
    // Reference count increment stalls due to
    // core 4 previously acquiring Exclusive ->
    // Modified access
    func2(ptr);
  }
}

In the above example, both cores are fighting over the cache line containing the reference count for the std::shared ptr<>. From a high level inspection of the code this isn’t obvious, but the performance will suffer significantly. Normal CPU profiling will show a hotspot in the std::shared_ptr<> copy constructor performing to reference count modifications. However, there won’t be any obvious indication of why it is expensive.

True sharing is why locks can be expensive even under the uncontended case. The last acquisition/release of a lock often takes place on a different core. So the cache line holding the state of the lock must migrate over to the current core’s cache first before the lock can be acquired. Under high contention on a spinlock, performance can absolutely collapse due to the cache line holding the lock bouncing across cores.

Thread Affinity

Tldr; Sharing is not caring. Keep operations against a data structure limited to a single thread for its lifetime if possible. Data structures should ideally be exclusively created, accessed, modified, and freed from the same thread and core.

Storage software I’ve worked on is incredibly aggressive about this. Allocators are thread aware. Many APIs require a thread ID indicating which data path thread is calling the API so state can be correctly partitioned across different threads to avoid cross thread access. A given IO operation is assigned to a single data path thread for it’s entire lifecycle, and no other thread/core is allowed to ever read/write to that struct. This has the beneficial side effect of side-stepping the need for any locking of state. Instead of moving the data to different cores to act on it, we keep/move computation to the core with ownership of the data.

Delayed Branch

Discussion about this post