Great stuff. I didn't know about the likely / __builtin_expect() gcc hint. I see how apt the publication name, Delayed Branch is.
If the cpu encounters branchyFunction() more than a few times (> 1? > 2,3 or 4?) can its branch predictor essentially do the same optimization when it sees that cmp %rax, 0 instruction at that location next time?
Hey VJ! Thank you, glad you're enjoying the posts.
Yes, the CPU branch predictor will "learn" patterns when a branch is executed repeatedly, or within a certain window of time (the branch predictor in modern CPUs is typically backed by some sort of history buffer which tracks the outcome of recently executed branch instructions). However, whatever the branch predictor learns, it won't update the layout of instructions in memory to achieve a higher cache density by placing taken branch targets adjacent to the branch instruction the way the compiler would be able to.
Typically likely()/__builtin_expect() is overkill for most application. Usually a tool like AutoFDO (https://research.google/pubs/pub45290/) is typically better. AutoFDO will actually run the application in question (ideally with real inputs and workloads) and with the help of instrumentation (for specific types of AutoFDO) measure the actual probabilities and patterns of taken/not-taken for branches. Essentially AutoFDO would automatically figure out where to place (or not place) the likely hint automatically, for every branch in the program. There are of course some details I'm eliding, but that's the general mental model you can follow.
Sweet, I made a note of that and the gcc wiki for AutoFDO. I've been out of the industry for a few years and unfortunately didn't spend time on lower level optimization like my colleague did. But I'm learning now! I know a while back Intel heavily encouraged the manual, (pre-deployment) instrumented profiling optimization for longer pipeline micro-arch like the Pentium 4, and in-order RISC cpus essentially required it (out of my scope at the time). That seemed focused on the the pipeline stalling penalties unpredictable branching would incur. However your point about cache density/tlb misses is a good one I hadn't thought about until now.
Great stuff. I didn't know about the likely / __builtin_expect() gcc hint. I see how apt the publication name, Delayed Branch is.
If the cpu encounters branchyFunction() more than a few times (> 1? > 2,3 or 4?) can its branch predictor essentially do the same optimization when it sees that cmp %rax, 0 instruction at that location next time?
Hey VJ! Thank you, glad you're enjoying the posts.
Yes, the CPU branch predictor will "learn" patterns when a branch is executed repeatedly, or within a certain window of time (the branch predictor in modern CPUs is typically backed by some sort of history buffer which tracks the outcome of recently executed branch instructions). However, whatever the branch predictor learns, it won't update the layout of instructions in memory to achieve a higher cache density by placing taken branch targets adjacent to the branch instruction the way the compiler would be able to.
Typically likely()/__builtin_expect() is overkill for most application. Usually a tool like AutoFDO (https://research.google/pubs/pub45290/) is typically better. AutoFDO will actually run the application in question (ideally with real inputs and workloads) and with the help of instrumentation (for specific types of AutoFDO) measure the actual probabilities and patterns of taken/not-taken for branches. Essentially AutoFDO would automatically figure out where to place (or not place) the likely hint automatically, for every branch in the program. There are of course some details I'm eliding, but that's the general mental model you can follow.
Sweet, I made a note of that and the gcc wiki for AutoFDO. I've been out of the industry for a few years and unfortunately didn't spend time on lower level optimization like my colleague did. But I'm learning now! I know a while back Intel heavily encouraged the manual, (pre-deployment) instrumented profiling optimization for longer pipeline micro-arch like the Pentium 4, and in-order RISC cpus essentially required it (out of my scope at the time). That seemed focused on the the pipeline stalling penalties unpredictable branching would incur. However your point about cache density/tlb misses is a good one I hadn't thought about until now.