19 Comments

I think I found a minor bug in Jason Rahman's code for calculating round trip times between CPU cores. Lines 126 to 128 (inside the server function) are:

localTickReqReceived = rdtsc();

data.serverToClientChannel.store(((uint64_t)((uint32_t)(localTickRespSend = rdtsc())) << 32) | (uint64_t)(uint32_t)localTickReqReceived);

Suppose the 64-bit values returned from the two calls to the time stamp counter rdtsc() have the following form:

first rdtsc() call: [32-bit value]FFFFFFFX

second rdtsc() call: [incremented 32-bit value]0000000Y

X and Y are arbitrary 4-bit values.

The lower 32-bits from the two calls are extracted and stored like this: 0x0000000YFFFFFFFX

Lines 93 to 100 (inside the client function) are:

localTickRespReceive = rdtsc();

auto upperBits = (localTickRespReceive & 0xffffffff00000000);

remoteTickRespSend = ((resp & 0xffffffff00000000) >> 32) | upperBits;

remoteTickReqReceive = (resp & 0x00000000ffffffff) | upperBits;

auto offset = (((int64_t)remoteTickReqReceive - (int64_t)localTickReqSend) + ((int64_t)remoteTickRespSend - (int64_t)localTickRespReceive)) / 2;

auto rtt = (localTickRespReceive - localTickReqSend) - (remoteTickRespSend - remoteTickReqReceive);

resp is loaded with 0x0000000YFFFFFFFX in line 88.

(remoteTickRespSend - remoteTickReqReceive) will be the upper 32-bits of resp minus the lower 32-bits of resp, which will be a negative number. This is surely not what was intended. I don't understand the code well enough to suggest a good fix but one idea would be to test if (remoteTickRespSend - remoteTickReqReceive) is negative and add 0x100000000 if it is. This added constant is [incremented 32-bit value] minus [32-bit value] from the two rdtsc() calls mentioned earlier. An alternative is to subtract the two 64-bit values of rdtsc() in the server function and send the lower 32-bits of the difference to the client function. By the way, what is the offset variable and how is it used?

What is (remoteTickRespSend - remoteTickReqReceive) suppose to measure? It looks to me that variability in (remoteTickRespSend - remoteTickReqReceive) is just a result of differences in out-of-order execution of instructions within the server function rather than differences in round trip time. I wonder if it would be better to delete (remoteTickRespSend - remoteTickReqReceive) from the expression for rtt.

I don't think fixing this minor bug will change the results that much. At a core frequency of 2 GHz, there is a carry across the 32-bit boundary of the time stamp counter roughly once every 2 seconds. The chance of the two rdtsc() calls in the server function straddling that seems pretty small.

In both the client and server functions, it may be better to use the newer rdtscp instruction instead of rdtsc. rdtscp is suppose to reduce the variability of measurements on processors that do out-of-order execution. All Xeon processors and all recent consumer processors do out-of-order execution.

Expand full comment

Hi Jason, I am confused by the coloring of the heat map. From the single socket 8488 one, it seems to me that the upper left most small tile, 0-11 to 0-11, seems to have the deepest color / meaning largest latency.. is that right? feels counter intuitive.. or am I reading it wrong? Thanks.

Expand full comment

I think you are reading the diagram correctly. You make an interesting observation. If we ignore hyperthreaded siblings, we can just focus on the upper left 1/4th of this diagram. The upper left 1/4th of the diagram shows 16 squares that correspond to communication between the 4 chiplets in one socket. There are 4 yellow squares (95ns), 8 orange squares (105ns) and 4 red squares (115ns).

I expected the 4 yellow squares to be on the upper left to lower right diagonal of the 16 squares since yellow squares indicate the lowest latency and the diagonal is where the squares for intra-chiplet communication are located. The blue pixels (50ns), corresponding to communication between hyperthreads on the same physical core, are in the expected locations so it does not look like rows or columns of this diagram got swapped.

If the 4 columns of these 16 squares are labeled 0, 1, 2 and 3 for the 4 chiplets and the same for the rows, it looks like there are two slow chiplets (0 and 2) and two fast chiplets (1 and 3). Communication within or between slow chiplets is slow (red). Communication within or between fast chiplets is fast (yellow). Communication between a slow chiplet and a fast chiplet is an intermediate speed (orange). This theory that there are two slow chiplets and two fast chiplets in each processor socket could be tested by running the same benchmark on each chiplet.

One reason Intel might put two slow chiplets and two fast chiplets in one socket is to average out the power consumption of a socket. The fast chiplets would use more power than the slow chiplets.

One complication is that when communicating in one socket with two EMIB hops, three chiplets are involved. Maybe Jason Rahman's suggestion of a bypass around the mesh network in the middle chiplet means the speed of the middle chiplet doesn't matter so much, but I am just making guesses that I don't completely believe.

The axes on this diagram (Intel Xeon Platinum 8488C Single Socket Core-to-Core RTT map) seem to have a typo because each axis shows 48 + 96 = 144 virtual cores instead of 48 + 48 = 96 virtual cores.

Expand full comment

Does this really produce comparable latency numbers across microarchitectures? It looks like you have an x86 PAUSE in the loop. The latency of PAUSE is not defined and varies wildly across implementations. Why not just spin?

Another aspect worth investigating would be the new WAITPKG ability to monitor an address from user space. Although the interface is a little clunky, if you are interested in how long it takes to wake another thread or core on a system, it hardly seems like you can ignore the fact that Sapphire Rapids is the first x86 server CPU to offer a brand new way to wake up another thread.

Expand full comment

Yes, you bring up a good point about the exact numbers being less than perfectly comparable across u-archs. I did update the github repo with version that can optionally spin using NOP rather than PAUSE instructions, can easily run that as well. Unfortunately these runs are pretty expensive to run on AWS, given it's only measuring one pair of cores concurrently, and the high total pair count. So we'll see when I can get around to running more of these types of tests with nop. I do this on my personal time, so any hardware from #DayJob is not really in bounds here, which leaves AWS/GCP/Azure VMs that do add up for long multi hour runs.

As an aside, I do recall the debacle when Skylake radically increased PAUSE latency. MySQL in particular was more strongly affected than many applications, and my day job at the time was running MySQL @ Scale, so that aspect of Skylake was particularly memorable (not in a good way).

As a tangent, the WAITPKG/UMONITOR stuff does look really fascinating, I've looked at it a bit before conceptually. One really fascinating potential use case would be watching a command queue in memory which an external system RDMAs a command buffer to. As soon as the head of the queue is updated by the remote write, the waiting thread could wake up and process the command + buffer. Would likely beat having to spin wait on reads from the queue, and could have better latency than 2 sided RDMA primitives. Unfortunately, I suspect UMONITOR may only work on Writeback Caching memory ranges, because I imagine it's probably implemented in terms of existing coherence mechanisms, and so would be bypassed by DMA/RDMA.

Expand full comment

Replying to Stefan S., all the cores are hypertheaded. My question was which virtual core numbers are in one processor socket and which virtual core numbers are in the other processor socket. If virtual core numbers 0 to 47 and 96 to 143 are in one socket, the diagram labeled "Intel Xeon Platinum 8488C Core-to-Core RTT map" indicates that the vast majority of the inter-socket Round Trip Time (RTT) is about 225ns. This suggests to me that the 3 nodes I described in my comment might be merged into one node centered at 225ns due to the EMIB crossing time being small or negligible.

The sizes of the rightmost 3 nodes in the RTT histogram do not seem consistent with these nodes being for UPI plus 1, 2 and 3 EMIB crossings because the middle node is not the biggest one, as would be expected. The table below shows the total number of EMIB crossings, which is the sum of the EMIB crossings in the first processor socket (horizontal) and second processor socket (vertical). There are two ways to get one EMIB crossing in a processor socket so both the horizontal and vertical axes have two 1s.

___ 0 __ 1 __ 1 __ 2

_________________

0_ | 0 __ 1 __ 1 __ 2

1 _ | 1 __ 2 __ 2 __ 3

1 _ | 1 __ 2 __ 2 __ 3

2_ | 2 __ 3 __ 3 __ 4

The table above shows when there is a UPI transfer (between processor sockets) 4/16 of the time there will be 1 EMIB crossing, 6/16 of the time there will be a total of 2 EMIB crossings and 4/16 of the time there will be a total of 3 EMIB crossings. Therefore, of the nodes for UPI, the node for 2 EMIB crossings (the middle node) should be the biggest one, that is, it should have the most counts. Looking at the 3 rightmost nodes in the RTT histogram, the leftmost of the 3 is the biggest one. This suggests the 3 rightmost nodes in the RTT histogram are not for UPI plus 1, 2 and 3 EMIB crossings.

Expand full comment

Commented more below, but the top most 3 modes in the distribution have been the most challenging to analyze and describe. Even the (simple) proposed model I've sketched out so far really doesn't fully explain those cases to my satisfaction.

Expand full comment

Hey Keith, have been wrapped up with other commitments this past week, will take a closer look at your comments when some free time opens up. In the meantime, here is the /proc/cpuinfo output from the host I tested with: https://gist.github.com/jrahman/016e2620df5e7e23d071a463c975e021

Expand full comment

Hi, is this something you could share on how to run and provide you output as I have an SPR bare metal host I could lend.

Expand full comment

Jason Rahman's code is here:

github.com/jrahman/cpu_core_rtt

Expand full comment

Thanks, sadly I'm not able to compile that. I have a 6421N I'd be willing to run some tests on (bare metal)

Expand full comment

Ok, https://github.com/jrahman/cpu_core_rtt has been updated with a fixed up version that works standalone

Expand full comment

Yes, that was an older version for local testing on an M1. Let me update Github with an updated version tonight.

Expand full comment

Thanks for the writeup!

Like Keith P., I also would have appreciated a little bit more detail on which cores are hyper-threaded and which are not.

Expand full comment

Hey Stefan, I just responded to Keith above, here is the /proc/cpuinfo output: https://gist.github.com/jrahman/016e2620df5e7e23d071a463c975e021.

Expand full comment

Thank you for your very interesting and well written article. For the Amazon instance you used with 1 UPI link, when communicating between sockets, there are 0 EMIB hops 1/16th of the time since only one of the 4 chiplets in each socket has an enabled UPI link. Similarly, 4 EMIB hops occurs another 1/16th of the time. The most common cases for 1 UPI hop are with 2 EMIB hops, which happens 6/16th of the time and 1 or 3 EMIB hops, which each happen 4/16th of the time. I would therefore expect the histogram of round trip times to have a big node (for 1 UPI + 2 EMIB) and two equal size nodes that are slightly smaller on each side of it (for 1 UPI + 1 or 3 EMIB). The two slightly smaller nodes should each be an equal distance away from the one big node. If the EMIB cost is small or negligible compared to the variation in mesh cost, these 3 nodes would merge into one. This might possibly be the node centered at 225ns that extends from 150ns to 250ns but that leaves unexplained the 3 rightmost nodes in your histogram of round trip times.

One way to figure out what is going on would be to use the CPUID instruction to determine which chiplet a CPU core is in and make separate histograms of round trip times for UPI plus 0, 1, 2, 3 and 4 EMIB hops and also for no UPI plus 0, 1 and 2 EMIB hops.

In the Platinum 8488C Core-to-Core RTT heat map, are the virtual (hyperthreaded) cores on one socket 0 to 47 and 96 to 143, with the other socket having the remaining virtual core numbers? If so, it looks like the vast majority of the inter-socket RTT is around 225ns, which does not agree with the proposed cost model.

You wrote "Next, looking at the 1 EMIB case: 50ns base cost + 25ns mesh cost (source die) + 25ns mesh cost (destination die) + N EMIB cost = 150ns → EMIB cost = 50ns." Where did you get this 150ns? The histogram of round trip times shows zero counts at 150ns.

The bottom 75% of the single socket round trip time CDF curve has no sharp rightward shifts. One interpretation is that the EMIB cost is negligible compared to the variation in mesh cost, otherwise when going from zero EMIB to one EMIB, the CDF curve would have shifted to the right. The rightward shift by about 50ns for the top 25% of the CDF curve can be interpreted as a 50ns penalty for 2 EMIB. If one EMIB has a negligible penalty and 2 EMIB has a 50ns penalty, the 50ns penalty could be for crossing the mesh in the middle chiplet for the 2 EMIB case.

What do you think is causing the occasional spikes in latency shown as red dots (450ns) in the light green areas (225ns) of your first plot? If it is hypervisor or operating system interference, I would expect the red dots to be scattered uniformly across the plot, instead of being concentrated in a few light green squares.

What did you use to make the graphs for your article?

Thank you again for your excellent article.

Expand full comment

Thanks for digging into the data!

> One way to figure out what is going on would be to use the CPUID instruction to determine which chiplet a CPU core is in and make separate histograms of round trip times for UPI plus 0, 1, 2, 3 and 4 EMIB hops and also for no UPI plus 0, 1 and 2 EMIB hops.

This would actually be a good follow-up analysis on my end. Arguably the initial data collected was mainly intended for heatmap generation, but with the existing data points, wouldn't be too hard to re-slice along known pairs of cores with those properties. From perusing through Intel documentation looks like cpuid with eax = 0x1f will return tile information through the v2 extended topology leaf. I tried running that on a SPR host, but no die/tile level topology information was reported: https://gist.github.com/jrahman/e4d2ba83769dd5e7553c6934537ada8b. Only SMT + socket level information. I did update https://github.com/jrahman/cpu_core_rtt with the code, gated behind a #ifdef TOPOLOGY preprocessor directive.

> are the virtual (hyperthreaded) cores on one socket 0 to 47 and 96 to 143, with the other socket having the remaining virtual core numbers?

I did pull the /proc/cpuid info (here: https://gist.github.com/jrahman/016e2620df5e7e23d071a463c975e0210). Based on the core id information, for socket 0, processors [0, 47] and [96, 143] are Hyperthread siblings.

> One interpretation is that the EMIB cost is negligible compared to the variation in mesh cost, otherwise when going from zero EMIB to one EMIB, the CDF curve would have shifted to the right. The rightward shift by about 50ns for the top 25% of the CDF curve can be interpreted as a 50ns penalty for 2 EMIB. If one EMIB has a negligible penalty and 2 EMIB has a 50ns penalty, the 50ns penalty could be for crossing the mesh in the middle chiplet for the 2 EMIB case.

This is one theory I considered. I looked at whether I could fit a model where EMIB hops were on par or cheaper than the mesh, and only 2 EMIB hops incurred a significant cost. I couldn't quite get such a model to cleanly fit the data as observed, but that's potentially due to deficiencies in other aspects of the model I considered at the time. At a physical implementation level, I do struggle to imagine Intel would have routed 2xEMIB hops in that way, passing flits through the full mesh across intervening dies. I would expect some sort of bypass network to avoid adding congestion to the per-die mesh. But I could easily be wrong about that, just my intuition if I were implementing EMIB.

> What do you think is causing the occasional spikes in latency shown as red dots (450ns) in the light green areas (225ns) of your first plot?

That isn't totally clear to me either. The benchmark returns the last few samples from a ring buffer. If the system was perturbed in any way at that point in time (network interrupt, SMIs,

For plotting, I'm running a Jupiter Notebook locally within VS Code, and plotting with a combination of matplotlib, and Seaborn. For the different graphs:

* heatmaps -> https://seaborn.pydata.org/generated/seaborn.heatmap.html

* kdeplot -> https://seaborn.pydata.org/generated/seaborn.kdeplot.html

* histplot -> https://seaborn.pydata.org/generated/seaborn.histplot.html

Expand full comment

Thank you for the /proc/cpuinfo information and the other answers you provided. In the diagram labeled "Intel Xeon Platinum 8488C Core-to-Core RTT map", the squares with smaller numbers (blue) correspond to intra-socket communication and the squares with bigger numbers (yellow) correspond to inter-socket communication. This diagram clearly shows that the vast majority of pixels in the squares for inter-socket communication are centered around 225ns.

That leaves the mystery of what is causing the 3 rightmost nodes in the RTT histogram. The only explanation I can think of is hypervisor or operating system interference but this explanation has at least 3 problems. First, if it is really hypervisor or operating system interference, the RTT histogram shows an awful lot of interference. Operating system interrupts should only occur about 100 to 1000 times per second. You mentioned in your reply to Jeffrey Baker that the runs take multiple hours so maybe the large amount of interference is reasonable. Perhaps there is some way to exclude time spent in the hypervisor and operating system kernel from the round trip time measurement. Second, it is unclear why hypervisor/operating system interference would cause 3 distinct nodes in the RTT histogram. Are there exactly 3 different types of hypervisor/operating system interference? If so, it would be interesting to know what are these 3 different types of interference. Third, of the 34 red spots (RTT = 400ns to 450ns) visible in the RTT map, 30 of the red spots are concentrated in two areas. It would be interesting to know if the exact location of these red spots is repeatable if the test is run a second time. If the exact location of the red spots is repeatable, that would suggest a weird hardware problem.

> At a physical implementation level, I do struggle to imagine Intel would have routed 2xEMIB hops in that way, passing flits through the full mesh across intervening dies. I would expect some sort of bypass network to avoid adding congestion to the per-die mesh.

I agree there is probably a bypass around the mesh used by the rest of the middle chiplet for the 2 EMIB case. Maybe the bypass and/or something else related to the 2 EMIB case takes 40ns to 50ns but I don't know what or why. From the bottom 75% of the CDF curve, it looks safe to say going from zero EMIB to one EMIB added negligible delay.

Page 40 of 214 (also called page 1-22) of Intel's CPUID instruction documentation linked below indicates that CPUID can provide SMT, Core, Module, Tile and Die, but the documentation does not the explain the difference between module, tile and die. Based on the order they are listed, I would guess "Die" means the whole contents of one processor socket. Tile probably means chiplet, that is, one of the 4 pieces of Sapphire Rapids connected by EMIBs. I have never used the CPUID instruction and I am having trouble understanding Intel's documentation.

intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

For Sapphire Rapids with 48 physical cores per socket, it might be possible to infer the tile/chiplet id from the core id like this:

core id 0 to 11 is tile/chiplet 0

core id 12 to 23 is tile/chiplet 1

core id 24 to 35 is tile/chiplet 2

core id 36 to 47 is tile/chiplet 3

To figure out which chiplet has the enabled UPI link, compare the latency between any core on the other socket to each of the 4 chiplets. The lowest latency will be for the chiplet with the enabled UPI link. It is probably safe to assume going between chiplets 0-3 and 1-2 are the cases with two EMIB hops.

Two questions about your workflow:

1. Are you writing and running C++ from within a Jupyter notebook?

2. Do you make your plots interactively by hand in the Jupyter notebook or do you use Python to call your C++ function and have a Python script create your plots?

Expand full comment

You are amazing 🤘🏽

Expand full comment