Avoiding Unnecessary Work
We’ve covered low level cache and hardware details which affect software performance. Let’s move to a higher level of abstraction now. There are three general ways to make a program faster:
Do less work
Do work faster
Do the work in parallel (Amdahl’s law)
In a throughput bound system, 3. isn’t viable because all CPU and other resources on the host are assumed to be fully utilized by other concurrent operations. That leaves doing work faster, and skipping unneeded work. To that end, let’s cover a few simple optimizations I’ve found helpful.
Avoiding Copies
Simple, do not copy things that do not need to be copied. std::move
is your friend here. Pass std::string_view
instead of std::string
. Newer C++ standards support std::span<T>
as a view type atop contiguous container types like std::array<T, K>
or std::vector<T>
. Unless the concrete container type is needed by the implementation, generally prefer std::string_view
and std::span<T>
over the concrete type. If an object's ownership/lifetime must be extended, do not pass std::shared_ptr<T>
, instead pass const T&
and dereference the std::shared_ptr<T>
at API boundaries to avoid the shared_ptr<>
copy (and reference count increment, see the earlier notes about cache coherence).
The auto
keyword and range for
loops can also trigger copies:
// Performs a copy
for (auto [guid, path] : metadataMap) {
// ...
}
auto foo = getValue();
// '&' binds a reference vs. a copy
for (const auto& [guid, path] : metadataMap) {
// ...
}
const auto& foo = getValue();
Whenever possible, use std::move()
to move rather than copy. Do be mindful, that std::move()
isn’t a magic wand. It’s only helpful for types whose move constructor/move assignment operator is significantly faster than the copy constructor/copy assignment operator. Structs with many primitive/non-movable fields don’t appreciably benefit from move semantics.
Trivially Copyable Types
When copies do need to occur, if the type being copied is trivially copyable (https://en.cppreference.com/w/cpp/named_req/TriviallyCopyable), the compiler/code author can replace a call to a copy constructor with a trivial memcpy(). Because the compiler knows the size of the memory being copied, it can unroll loops to avoid branches, and even automatically vectorize the copy using SSE/AVX instructions to execute the copy significantly faster. Keeping types trivial if possible helps the compiler here. If a type must remain trivially copyable, static_assert(std::is_trivially_copyable_v<Type>)
can be added to code to ensure this invariant remains true.
Strength Reduction
Common operations like integer division are significantly more expensive than would be anticipated. For example, an idiv
instruction on Intel CPUs easily takes 25-90 CPU cycles (depending on the operand sizes): https://www.agner.org/optimize/instruction_tables.pdf. Pieces of code like the one below are much more expensive than expected:
int readRingBuffer(int* buf, size_t size, size_t index) {
return buf[index % size];
}
There are two problems here:
The ‘%’ turns into an
idiv
instruction with at least 25 cycle latency on modern Intel/AMD CPUs (or 40+ if sizeof(size_t) == 8 and it’s a 64-bitidiv
)The high latency division is on the critical path of the function, the CPU cannot determine the index into the array until after the division fully completes, stalling the CPU pipeline
The solution is embedded in the fact that size is often a power of two. When the divisor is a power of 2, the modulus operation can be replaced with & (size - 1). An example:
int readRingBuffer(int* buf, size_t sizePow2, size_t index) {
return buf[index & (sizePow2 - 1)];
}
The subtraction and bitwise AND only require 1-2 cycles each. Similarly, division by a power of 2 is easily replaced with a right shift by the power of 2:
// Bad, 25+ clock cycles
uint32_t size;
auto lbas = size / lbaSize;
// Good, ~1-2 clock cycles
auto lbas = size >> lbaSizePow2;
For divisors and modulii which are known a priori, but not a power of 2, libdivide is an option as well: https://libdivide.com/. This is also useful for divisors which are not known at compilation time, but which are used repeatedly at runtime.
std::shared_ptr<T> woes
This is somewhat related to the cache locality/true sharing issue discussed in a previous note, but wanted to call it out separately here. Shared pointers are nearly always an anti-pattern, and a symptom the object lifetime model of a class or system is broken or underspecified. In practice shared_ptrs
tend to be used as hacks for a underspecified object ownership/lifetime model:
folly::Future<T> Clazz::myAsyncAPI() {
client_->myRpc(createParams()).thenTry([self = shared_from_this()](auto res) {
return self->doSomeProcessing(std::move(res));
});
}
The real problem is the lifetime of the Clazz
instance is leaked outside of the control of the owner because of the continuation attached to the lambda. In application level code, coroutines offer a solution. Because async calls are ‘blocked’ by a co_await the myAsyncAPI() call doesn’t need to leak a shared_ptr:
folly::coro::Task<T> Clazz::myAsyncAPI() {
auto res = co_await client_->myRpc(createParams());
return doSomeProcessing(std::move(res));
}
A better way to handle this is to correctly set up the ownership model so that the async operation must complete before the top level objects can be destroyed:
class Clazz {
public:
Clazz() {
client_ = makeClient();
}
void myAsyncAPI(Callback cb) {
// Safe to capture 'this' because Clazz owns client_ and
// client_ won't destruct until all pending requests complete
client_->myRPC([this, cb = std::move(cb)](auto res) {
cb(doSomeProcessing(std::move(res));
});
}
private:
// Client will block from being destroyed until all requests drain
AsyncClient client_;
}
In general, shared_ptr<>
is performance poison, and is to be avoided at all costs, there are many different posts internal and external outlining issues with shared_ptr<>
:
If std::shared_ptr<T>
must absolutely be used:
Do not copy them on the critical path
Do not perform copies across different cores, especially
Since that bounces the cache line containing the control block across cores because of the reference count issue
Do not return
std::shared_ptr<>
to internal struct instancesOnce a
std::shared_ptr<>
leaks outside a client visible API, clients copy at will triggering ref count increments and potential memory leaks
LIKELY/UNLIKELY
While modern CPUs have powerful branch predictors which achieve high prediction performance, developers can still provide assistance. Low level system software makes aggressive use of the LIKELY()/UNLIKELY() macros to hint to the compiler which direction a branch is expected to be taken. The most practical impact from this hint is the compiler can lay out the instructions in a more instruction cache friendly manner. For example:
void branchyFunction(uint32_t argument) {
if (argument == 0) {
// Error handling code
return
}
// Processing code
}
Naively without any hints, the compiler will lay out the assembly code something like this:
cmp %rax, 0
jne <processing_code>
// Error handling assembly
ret
processing_code:
// Processing assembly
The problem here is the error handling instructions are rarely executed. The conditional jump must skip over the error handling instructions to the later processing_code label. Depending on the size of the error handling instructions, the instructions at that label could easily be on a different cache line, and trigger a cache miss. Instead, a hint about the branch’s likelihood is given to the compiler:
void branchyFunction(uint32_t argument) {
if (UNLIKELY(argument == 0)) {
// Error handling code
return
}
// Processing code
}
which (depending on compiler options and optimization levels) is compiled down to something like:
cmp %rax, 0
jeq <error_handler>
processing_code:
// Processing assembly
ret
error_handler:
// Error handling assembly
ret
The hot path is straight line code located on the same cache line as the jeq instruction. Instruction cache misses are reduced and less pressure is placed on the instruction cache. High performance software makes aggressive use of LIKELY/UNLIKELY, Linux for example makes extensive use of branch prediction hints: https://kernelnewbies.org/FAQ/LikelyUnlikely.
Great stuff. I didn't know about the likely / __builtin_expect() gcc hint. I see how apt the publication name, Delayed Branch is.
If the cpu encounters branchyFunction() more than a few times (> 1? > 2,3 or 4?) can its branch predictor essentially do the same optimization when it sees that cmp %rax, 0 instruction at that location next time?