19 Comments

I think I found a minor bug in Jason Rahman's code for calculating round trip times between CPU cores. Lines 126 to 128 (inside the server function) are:

localTickReqReceived = rdtsc();

data.serverToClientChannel.store(((uint64_t)((uint32_t)(localTickRespSend = rdtsc())) << 32) | (uint64_t)(uint32_t)localTickReqReceived);

Suppose the 64-bit values returned from the two calls to the time stamp counter rdtsc() have the following form:

first rdtsc() call: [32-bit value]FFFFFFFX

second rdtsc() call: [incremented 32-bit value]0000000Y

X and Y are arbitrary 4-bit values.

The lower 32-bits from the two calls are extracted and stored like this: 0x0000000YFFFFFFFX

Lines 93 to 100 (inside the client function) are:

localTickRespReceive = rdtsc();

auto upperBits = (localTickRespReceive & 0xffffffff00000000);

remoteTickRespSend = ((resp & 0xffffffff00000000) >> 32) | upperBits;

remoteTickReqReceive = (resp & 0x00000000ffffffff) | upperBits;

auto offset = (((int64_t)remoteTickReqReceive - (int64_t)localTickReqSend) + ((int64_t)remoteTickRespSend - (int64_t)localTickRespReceive)) / 2;

auto rtt = (localTickRespReceive - localTickReqSend) - (remoteTickRespSend - remoteTickReqReceive);

resp is loaded with 0x0000000YFFFFFFFX in line 88.

(remoteTickRespSend - remoteTickReqReceive) will be the upper 32-bits of resp minus the lower 32-bits of resp, which will be a negative number. This is surely not what was intended. I don't understand the code well enough to suggest a good fix but one idea would be to test if (remoteTickRespSend - remoteTickReqReceive) is negative and add 0x100000000 if it is. This added constant is [incremented 32-bit value] minus [32-bit value] from the two rdtsc() calls mentioned earlier. An alternative is to subtract the two 64-bit values of rdtsc() in the server function and send the lower 32-bits of the difference to the client function. By the way, what is the offset variable and how is it used?

What is (remoteTickRespSend - remoteTickReqReceive) suppose to measure? It looks to me that variability in (remoteTickRespSend - remoteTickReqReceive) is just a result of differences in out-of-order execution of instructions within the server function rather than differences in round trip time. I wonder if it would be better to delete (remoteTickRespSend - remoteTickReqReceive) from the expression for rtt.

I don't think fixing this minor bug will change the results that much. At a core frequency of 2 GHz, there is a carry across the 32-bit boundary of the time stamp counter roughly once every 2 seconds. The chance of the two rdtsc() calls in the server function straddling that seems pretty small.

In both the client and server functions, it may be better to use the newer rdtscp instruction instead of rdtsc. rdtscp is suppose to reduce the variability of measurements on processors that do out-of-order execution. All Xeon processors and all recent consumer processors do out-of-order execution.

Expand full comment
Aug 18, 2023·edited Aug 18, 2023

Hi Jason, I am confused by the coloring of the heat map. From the single socket 8488 one, it seems to me that the upper left most small tile, 0-11 to 0-11, seems to have the deepest color / meaning largest latency.. is that right? feels counter intuitive.. or am I reading it wrong? Thanks.

Expand full comment

Does this really produce comparable latency numbers across microarchitectures? It looks like you have an x86 PAUSE in the loop. The latency of PAUSE is not defined and varies wildly across implementations. Why not just spin?

Another aspect worth investigating would be the new WAITPKG ability to monitor an address from user space. Although the interface is a little clunky, if you are interested in how long it takes to wake another thread or core on a system, it hardly seems like you can ignore the fact that Sapphire Rapids is the first x86 server CPU to offer a brand new way to wake up another thread.

Expand full comment
Aug 12, 2023·edited Aug 14, 2023

Replying to Stefan S., all the cores are hypertheaded. My question was which virtual core numbers are in one processor socket and which virtual core numbers are in the other processor socket. If virtual core numbers 0 to 47 and 96 to 143 are in one socket, the diagram labeled "Intel Xeon Platinum 8488C Core-to-Core RTT map" indicates that the vast majority of the inter-socket Round Trip Time (RTT) is about 225ns. This suggests to me that the 3 nodes I described in my comment might be merged into one node centered at 225ns due to the EMIB crossing time being small or negligible.

The sizes of the rightmost 3 nodes in the RTT histogram do not seem consistent with these nodes being for UPI plus 1, 2 and 3 EMIB crossings because the middle node is not the biggest one, as would be expected. The table below shows the total number of EMIB crossings, which is the sum of the EMIB crossings in the first processor socket (horizontal) and second processor socket (vertical). There are two ways to get one EMIB crossing in a processor socket so both the horizontal and vertical axes have two 1s.

___ 0 __ 1 __ 1 __ 2

_________________

0_ | 0 __ 1 __ 1 __ 2

1 _ | 1 __ 2 __ 2 __ 3

1 _ | 1 __ 2 __ 2 __ 3

2_ | 2 __ 3 __ 3 __ 4

The table above shows when there is a UPI transfer (between processor sockets) 4/16 of the time there will be 1 EMIB crossing, 6/16 of the time there will be a total of 2 EMIB crossings and 4/16 of the time there will be a total of 3 EMIB crossings. Therefore, of the nodes for UPI, the node for 2 EMIB crossings (the middle node) should be the biggest one, that is, it should have the most counts. Looking at the 3 rightmost nodes in the RTT histogram, the leftmost of the 3 is the biggest one. This suggests the 3 rightmost nodes in the RTT histogram are not for UPI plus 1, 2 and 3 EMIB crossings.

Expand full comment

Hi, is this something you could share on how to run and provide you output as I have an SPR bare metal host I could lend.

Expand full comment
Aug 12, 2023·edited Aug 12, 2023

Thanks for the writeup!

Like Keith P., I also would have appreciated a little bit more detail on which cores are hyper-threaded and which are not.

Expand full comment

Thank you for your very interesting and well written article. For the Amazon instance you used with 1 UPI link, when communicating between sockets, there are 0 EMIB hops 1/16th of the time since only one of the 4 chiplets in each socket has an enabled UPI link. Similarly, 4 EMIB hops occurs another 1/16th of the time. The most common cases for 1 UPI hop are with 2 EMIB hops, which happens 6/16th of the time and 1 or 3 EMIB hops, which each happen 4/16th of the time. I would therefore expect the histogram of round trip times to have a big node (for 1 UPI + 2 EMIB) and two equal size nodes that are slightly smaller on each side of it (for 1 UPI + 1 or 3 EMIB). The two slightly smaller nodes should each be an equal distance away from the one big node. If the EMIB cost is small or negligible compared to the variation in mesh cost, these 3 nodes would merge into one. This might possibly be the node centered at 225ns that extends from 150ns to 250ns but that leaves unexplained the 3 rightmost nodes in your histogram of round trip times.

One way to figure out what is going on would be to use the CPUID instruction to determine which chiplet a CPU core is in and make separate histograms of round trip times for UPI plus 0, 1, 2, 3 and 4 EMIB hops and also for no UPI plus 0, 1 and 2 EMIB hops.

In the Platinum 8488C Core-to-Core RTT heat map, are the virtual (hyperthreaded) cores on one socket 0 to 47 and 96 to 143, with the other socket having the remaining virtual core numbers? If so, it looks like the vast majority of the inter-socket RTT is around 225ns, which does not agree with the proposed cost model.

You wrote "Next, looking at the 1 EMIB case: 50ns base cost + 25ns mesh cost (source die) + 25ns mesh cost (destination die) + N EMIB cost = 150ns → EMIB cost = 50ns." Where did you get this 150ns? The histogram of round trip times shows zero counts at 150ns.

The bottom 75% of the single socket round trip time CDF curve has no sharp rightward shifts. One interpretation is that the EMIB cost is negligible compared to the variation in mesh cost, otherwise when going from zero EMIB to one EMIB, the CDF curve would have shifted to the right. The rightward shift by about 50ns for the top 25% of the CDF curve can be interpreted as a 50ns penalty for 2 EMIB. If one EMIB has a negligible penalty and 2 EMIB has a 50ns penalty, the 50ns penalty could be for crossing the mesh in the middle chiplet for the 2 EMIB case.

What do you think is causing the occasional spikes in latency shown as red dots (450ns) in the light green areas (225ns) of your first plot? If it is hypervisor or operating system interference, I would expect the red dots to be scattered uniformly across the plot, instead of being concentrated in a few light green squares.

What did you use to make the graphs for your article?

Thank you again for your excellent article.

Expand full comment

You are amazing 🤘🏽

Expand full comment