Software Performance: System Optimizations

Jul 11, 2022

Introduction

In previous notes, we looked at a first principles approach to system performance, centered around tail latencies. Next, we’ll look at higher level performance considerations at the system/host level:

Context switches
System calls
Interrupts
TLB misses
Page faults

These factors are among the largest contributors to tail latency in practical systems. Context switches take a thread off a CPU core for many milliseconds at a time, interrupts can strike at any time in between any pair of instructions without warning, page faults lurk behind any memory access, and so on. Let’s walk through some of the common system level sources of variance, and ways to address them.

System Calls

System calls destroy application performance. Several reasons why:

The transition to kernel mode from user mode itself is expensive (100s of clock cycles)
TLB and cache trashing
Many system calls block on external entities (locks, network/disk IO, so on) for an unbounded amount of time

At 1000s or 10000s of thousands of operations, system calls are expensive, but not fatal. However, at microsecond scale needed for storage, and at high throughput (100k-1M operations per second), system calls quickly become too expensive.

Best practice is to avoid system calls in the hot data path, or at worst punt them to a separate thread which can afford to block. Otherwise, io_uring (https://kernel.dk/io_uring.pdf) offers an improved interface into the kernel that can submit IO requests to the kernel without performing a system call (in SQ poll mode). With io_uring the application puts the IO request into a ring buffer in memory, and the kernel pulls the operation off the queue. Completions are posted to a separate completion ring buffer which the application reads from. For networking specific applications, DPDK is another option which outright bypasses the kernel, while SPDK takes a similar approach for storage.

Hidden System Calls

One really nasty catch is that system calls are often hidden beneath abstraction layers. Consider the following (seemingly innocuous) code:

void processRequest(Request* request) {
  if (!validateRequest(request)) {
    LOG(ERROR) << "Request invalid";
    return;
  }
  handleRequest(request);
}

What’s wrong here? Well, what does LOG(ERROR) do? If you guessed executing system calls, you’d be correct! In fact, glog and other logging frameworks hide multiple memory allocations for the string buffers, string copies to construct the log message, and then finally performs a write(2, ...) system call to write the buffer out to stderr. That’s bad enough, but it gets worse. Most of the time that write will be buffered in memory (say in a Unix pipe or such). However, there is no bound for how long the call can block. Some real life examples I’ve seen in production include:

stderr being re-directed to a file on the boot drive → boot drive stalls → application stalls on logging call
stderr being read through a Unix pipe from another process → other process stalls/blocks → doesn’t read from pipe → pipe fills up → application writing to stderr stalls

An alternative logging approach would be to perform async logging using a background thread with an MPSC queue/ring buffer. The data path thread writes log messages into a MPSC queue/ring buffer, and a background thread pulls from the MPSC queue/ring buffer to perform the blocking write from a background thread. The fundamental tradeoff there is logs from immediately prior to a crash might be lost if logs aren’t flushed to disk before the crash. One way around this is to use IPC shared memory for logging, so logs can at least survive process crashes for post-mortem debugging.

Context Switches

Kernel threads (as opposed to user space threads, more commonly known as “green threads”, “coroutines”, or “fibers”) are given the illusion of continuous sequential execution on a CPU. However, in practice Linux and modern operating systems will multiplex M runnable threads on N CPU cores. If the number of runnable threads M exceeds the number of CPU cores N, the excess (M - N) threads will be periodically context switched out, leading to stalls lasting 100s of microseconds or even 10s of milliseconds. It’s very easy to accidentally get into this state, with threads being preempted unexpectedly.

A common pattern in applications is to have several different thread pools:

CPU heavy thread pool for request processing
IO thread poll spinning on epoll()
Background thread pools for blocking IO operations
Random thread pools from background libraries

The last two set of thread pools is what can get us in trouble. Application owners typically spend the time to tune the first 2 thread pools correctly, ensuring the total number of threads in both is equal to the core count. However, those last two burn us for a couple reasons. First, the background blocking IO thread poll is often sized to be fairly large to support sufficient throughput of blocking operations (remember, for blocking IO, throughput is the reciprocal of latency). There are a couple fixes here:

Thread pinning
Pool resizing

Thread Pinning

Take the most latency sensitive threads and pin them to a subset of the CPU cores. Take the remaining (non-latency sensitive) threads and pin them to the remaining cores. Soft (and hard) realtime systems will divide their workload into data path/latency sensitive threads, and control path/latency insensitive threads. Kernel boot parameters like isolcpus carve out of set of CPU cores for exclusive use. Then the application will pin data path threads 1:1 to the reserved CPU cores. The remaining CPU cores run the control path threads. There must be a 1:1 mapping between data path threads and reserved CPU cores. While the control path threads can over subscribe the available unreserved CPU cores (since they can tolerate context switching). Thread affinitization is a very common pattern, some examples:

Pool Resizing

Along with pinning threads, the thread pools should be tuned to only be as large as needed. What exactly this looks like will depend on the programming language in use, the common libraries which use their own threads and how those libraries are configured.

Interrupts

High performance NICs will spit out millions of interrupts per second to handle packet processing. Each interrupt will momentarily steal the CPU core from the running thread on that core for some interval. The now interrupted thread will be stalled, and memory accesses on behalf of the interrupt handler will evict L1/L2 cache lines which were in use by the application thread. When the application thread resumes running, it’ll suffer capacity misses due to the cached memory evicted by interrupt handler prior to yielding the core back to the application thread. Interrupts can also increase lock hold times when the interrupted thread is holding a lock needed by another thread.

To avoid these issues, interrupts should be routed to a subset of cores running non latency sensitive work (the cores running control path threads mentioned above). Also, pin background kernel threads (kworkerd, xfsaild, and others) to non-data path cores to improve performance and lower latency on the data path cores. So far our optimizations in this area are limited to NIC IRQ affinitization. However, IRQs for other devices should also be affinitized away from latency sensitive threads/cores.

For an even better option, disable interrupts outright! To be more precise, use polling rather than interrupt based IO. This is very common else where in industry for high throughput, low latency IO. Some examples of this:

io_uring also supports polling driven io to eliminate system calls and interrupts.

TLB Misses

The very act of loading a memory location can trigger a Translation Lookaside Buffer (https://en.wikipedia.org/wiki/Translation_lookaside_buffer) miss when performing the virtual → physical memory address translation. TLB misses trigger a page table walk using the hardware page table walker. This is not a cheap operation, easily 100+ clock cycles (depending on memory access times to read the page table entries).

Using hugepages can dramatically reduce how many TLB misses an application suffers. Typically memory mappings use 4k pages. Larger memory ranges for tracing buffers and data transfer will then consume a large number of TLB entries. Rather than use many small pages, hugepages (2MB or 1GB) replaces many small pages with a single hugepage. Fewer pages reduces the # of TLB entries needed and thus lowers the # of TLB misses.

Page Faults

Each virtual memory page has a mapping to physical addresses in the page table. Occasionally that mapping is invalid. When a virtual address with an invalid page table entry is accessed, the CPU raises a page fault. This page fault must be handled by the operating system. There are two categories of page faults:

Soft faults
- Memory is available, but OS needs allocate a physical page + set up a mapping
Hard faults
- Backing data for the virtual page must be loaded from storage (typically from swap or executable pages backed by an executable file on disk)

Avoided soft faults by calling mlock() on the memory range to force the kernel to fault the pages into physical memory and set up the mappings in advance. Hard faults can be avoided by disabling swap and mlock()ing executable pages for the applications hot path. Linux does impose limits on the amount of memory a process can lock. However, these limits can be raised as needed.

Delayed Branch

Discussion about this post