Delayed Branch

Aug 18, 2023

I think you are reading the diagram correctly. You make an interesting observation. If we ignore hyperthreaded siblings, we can just focus on the upper left 1/4th of this diagram. The upper left 1/4th of the diagram shows 16 squares that correspond to communication between the 4 chiplets in one socket. There are 4 yellow squares (95ns), 8 orange squares (105ns) and 4 red squares (115ns).

I expected the 4 yellow squares to be on the upper left to lower right diagonal of the 16 squares since yellow squares indicate the lowest latency and the diagonal is where the squares for intra-chiplet communication are located. The blue pixels (50ns), corresponding to communication between hyperthreads on the same physical core, are in the expected locations so it does not look like rows or columns of this diagram got swapped.

If the 4 columns of these 16 squares are labeled 0, 1, 2 and 3 for the 4 chiplets and the same for the rows, it looks like there are two slow chiplets (0 and 2) and two fast chiplets (1 and 3). Communication within or between slow chiplets is slow (red). Communication within or between fast chiplets is fast (yellow). Communication between a slow chiplet and a fast chiplet is an intermediate speed (orange). This theory that there are two slow chiplets and two fast chiplets in each processor socket could be tested by running the same benchmark on each chiplet.

One reason Intel might put two slow chiplets and two fast chiplets in one socket is to average out the power consumption of a socket. The fast chiplets would use more power than the slow chiplets.

One complication is that when communicating in one socket with two EMIB hops, three chiplets are involved. Maybe Jason Rahman's suggestion of a bypass around the mesh network in the middle chiplet means the speed of the middle chiplet doesn't matter so much, but I am just making guesses that I don't completely believe.

The axes on this diagram (Intel Xeon Platinum 8488C Single Socket Core-to-Core RTT map) seem to have a typo because each axis shows 48 + 96 = 144 virtual cores instead of 48 + 48 = 96 virtual cores.

Expand full comment

Jeffrey Baker

Does this really produce comparable latency numbers across microarchitectures? It looks like you have an x86 PAUSE in the loop. The latency of PAUSE is not defined and varies wildly across implementations. Why not just spin?

Another aspect worth investigating would be the new WAITPKG ability to monitor an address from user space. Although the interface is a little clunky, if you are interested in how long it takes to wake another thread or core on a system, it hardly seems like you can ignore the fact that Sapphire Rapids is the first x86 server CPU to offer a brand new way to wake up another thread.

Expand full comment

Yes, you bring up a good point about the exact numbers being less than perfectly comparable across u-archs. I did update the github repo with version that can optionally spin using NOP rather than PAUSE instructions, can easily run that as well. Unfortunately these runs are pretty expensive to run on AWS, given it's only measuring one pair of cores concurrently, and the high total pair count. So we'll see when I can get around to running more of these types of tests with nop. I do this on my personal time, so any hardware from #DayJob is not really in bounds here, which leaves AWS/GCP/Azure VMs that do add up for long multi hour runs.

As an aside, I do recall the debacle when Skylake radically increased PAUSE latency. MySQL in particular was more strongly affected than many applications, and my day job at the time was running MySQL @ Scale, so that aspect of Skylake was particularly memorable (not in a good way).

As a tangent, the WAITPKG/UMONITOR stuff does look really fascinating, I've looked at it a bit before conceptually. One really fascinating potential use case would be watching a command queue in memory which an external system RDMAs a command buffer to. As soon as the head of the queue is updated by the remote write, the waiting thread could wake up and process the command + buffer. Would likely beat having to spin wait on reads from the queue, and could have better latency than 2 sided RDMA primitives. Unfortunately, I suspect UMONITOR may only work on Writeback Caching memory ranges, because I imagine it's probably implemented in terms of existing coherence mechanisms, and so would be bypassed by DMA/RDMA.

Expand full comment

Aug 12, 2023Edited

Replying to Stefan S., all the cores are hypertheaded. My question was which virtual core numbers are in one processor socket and which virtual core numbers are in the other processor socket. If virtual core numbers 0 to 47 and 96 to 143 are in one socket, the diagram labeled "Intel Xeon Platinum 8488C Core-to-Core RTT map" indicates that the vast majority of the inter-socket Round Trip Time (RTT) is about 225ns. This suggests to me that the 3 nodes I described in my comment might be merged into one node centered at 225ns due to the EMIB crossing time being small or negligible.

The sizes of the rightmost 3 nodes in the RTT histogram do not seem consistent with these nodes being for UPI plus 1, 2 and 3 EMIB crossings because the middle node is not the biggest one, as would be expected. The table below shows the total number of EMIB crossings, which is the sum of the EMIB crossings in the first processor socket (horizontal) and second processor socket (vertical). There are two ways to get one EMIB crossing in a processor socket so both the horizontal and vertical axes have two 1s.

___ 0 __ 1 __ 1 __ 2

_________________

0_ | 0 __ 1 __ 1 __ 2

1 _ | 1 __ 2 __ 2 __ 3

2_ | 2 __ 3 __ 3 __ 4

The table above shows when there is a UPI transfer (between processor sockets) 4/16 of the time there will be 1 EMIB crossing, 6/16 of the time there will be a total of 2 EMIB crossings and 4/16 of the time there will be a total of 3 EMIB crossings. Therefore, of the nodes for UPI, the node for 2 EMIB crossings (the middle node) should be the biggest one, that is, it should have the most counts. Looking at the 3 rightmost nodes in the RTT histogram, the leftmost of the 3 is the biggest one. This suggests the 3 rightmost nodes in the RTT histogram are not for UPI plus 1, 2 and 3 EMIB crossings.

Expand full comment

Reply (2)

Commented more below, but the top most 3 modes in the distribution have been the most challenging to analyze and describe. Even the (simple) proposed model I've sketched out so far really doesn't fully explain those cases to my satisfaction.

Expand full comment

Hey Keith, have been wrapped up with other commitments this past week, will take a closer look at your comments when some free time opens up. In the meantime, here is the /proc/cpuinfo output from the host I tested with: https://gist.github.com/jrahman/016e2620df5e7e23d071a463c975e021

Expand full comment

Ryan

Aug 12, 2023

Hi, is this something you could share on how to run and provide you output as I have an SPR bare metal host I could lend.

Expand full comment

github.com/jrahman/cpu_core_rtt

Jason Rahman's code is here:

Expand full comment

Ryan

Thanks, sadly I'm not able to compile that. I have a 6421N I'd be willing to run some tests on (bare metal)

Expand full comment

Reply (2)

Ok, https://github.com/jrahman/cpu_core_rtt has been updated with a fixed up version that works standalone

Expand full comment

Yes, that was an older version for local testing on an M1. Let me update Github with an updated version tonight.

Expand full comment

Stefan S.

Aug 12, 2023Edited

Thanks for the writeup!

Like Keith P., I also would have appreciated a little bit more detail on which cores are hyper-threaded and which are not.

Expand full comment

Hey Stefan, I just responded to Keith above, here is the /proc/cpuinfo output: https://gist.github.com/jrahman/016e2620df5e7e23d071a463c975e021.

Expand full comment

Aug 9, 2023

Thank you for your very interesting and well written article. For the Amazon instance you used with 1 UPI link, when communicating between sockets, there are 0 EMIB hops 1/16th of the time since only one of the 4 chiplets in each socket has an enabled UPI link. Similarly, 4 EMIB hops occurs another 1/16th of the time. The most common cases for 1 UPI hop are with 2 EMIB hops, which happens 6/16th of the time and 1 or 3 EMIB hops, which each happen 4/16th of the time. I would therefore expect the histogram of round trip times to have a big node (for 1 UPI + 2 EMIB) and two equal size nodes that are slightly smaller on each side of it (for 1 UPI + 1 or 3 EMIB). The two slightly smaller nodes should each be an equal distance away from the one big node. If the EMIB cost is small or negligible compared to the variation in mesh cost, these 3 nodes would merge into one. This might possibly be the node centered at 225ns that extends from 150ns to 250ns but that leaves unexplained the 3 rightmost nodes in your histogram of round trip times.

One way to figure out what is going on would be to use the CPUID instruction to determine which chiplet a CPU core is in and make separate histograms of round trip times for UPI plus 0, 1, 2, 3 and 4 EMIB hops and also for no UPI plus 0, 1 and 2 EMIB hops.

In the Platinum 8488C Core-to-Core RTT heat map, are the virtual (hyperthreaded) cores on one socket 0 to 47 and 96 to 143, with the other socket having the remaining virtual core numbers? If so, it looks like the vast majority of the inter-socket RTT is around 225ns, which does not agree with the proposed cost model.

You wrote "Next, looking at the 1 EMIB case: 50ns base cost + 25ns mesh cost (source die) + 25ns mesh cost (destination die) + N EMIB cost = 150ns → EMIB cost = 50ns." Where did you get this 150ns? The histogram of round trip times shows zero counts at 150ns.

The bottom 75% of the single socket round trip time CDF curve has no sharp rightward shifts. One interpretation is that the EMIB cost is negligible compared to the variation in mesh cost, otherwise when going from zero EMIB to one EMIB, the CDF curve would have shifted to the right. The rightward shift by about 50ns for the top 25% of the CDF curve can be interpreted as a 50ns penalty for 2 EMIB. If one EMIB has a negligible penalty and 2 EMIB has a 50ns penalty, the 50ns penalty could be for crossing the mesh in the middle chiplet for the 2 EMIB case.

What do you think is causing the occasional spikes in latency shown as red dots (450ns) in the light green areas (225ns) of your first plot? If it is hypervisor or operating system interference, I would expect the red dots to be scattered uniformly across the plot, instead of being concentrated in a few light green squares.

What did you use to make the graphs for your article?

Thank you again for your excellent article.

Expand full comment

Thanks for digging into the data!

> One way to figure out what is going on would be to use the CPUID instruction to determine which chiplet a CPU core is in and make separate histograms of round trip times for UPI plus 0, 1, 2, 3 and 4 EMIB hops and also for no UPI plus 0, 1 and 2 EMIB hops.

This would actually be a good follow-up analysis on my end. Arguably the initial data collected was mainly intended for heatmap generation, but with the existing data points, wouldn't be too hard to re-slice along known pairs of cores with those properties. From perusing through Intel documentation looks like cpuid with eax = 0x1f will return tile information through the v2 extended topology leaf. I tried running that on a SPR host, but no die/tile level topology information was reported: https://gist.github.com/jrahman/e4d2ba83769dd5e7553c6934537ada8b. Only SMT + socket level information. I did update https://github.com/jrahman/cpu_core_rtt with the code, gated behind a #ifdef TOPOLOGY preprocessor directive.

> are the virtual (hyperthreaded) cores on one socket 0 to 47 and 96 to 143, with the other socket having the remaining virtual core numbers?

I did pull the /proc/cpuid info (here: https://gist.github.com/jrahman/016e2620df5e7e23d071a463c975e0210). Based on the core id information, for socket 0, processors [0, 47] and [96, 143] are Hyperthread siblings.

> One interpretation is that the EMIB cost is negligible compared to the variation in mesh cost, otherwise when going from zero EMIB to one EMIB, the CDF curve would have shifted to the right. The rightward shift by about 50ns for the top 25% of the CDF curve can be interpreted as a 50ns penalty for 2 EMIB. If one EMIB has a negligible penalty and 2 EMIB has a 50ns penalty, the 50ns penalty could be for crossing the mesh in the middle chiplet for the 2 EMIB case.

This is one theory I considered. I looked at whether I could fit a model where EMIB hops were on par or cheaper than the mesh, and only 2 EMIB hops incurred a significant cost. I couldn't quite get such a model to cleanly fit the data as observed, but that's potentially due to deficiencies in other aspects of the model I considered at the time. At a physical implementation level, I do struggle to imagine Intel would have routed 2xEMIB hops in that way, passing flits through the full mesh across intervening dies. I would expect some sort of bypass network to avoid adding congestion to the per-die mesh. But I could easily be wrong about that, just my intuition if I were implementing EMIB.

> What do you think is causing the occasional spikes in latency shown as red dots (450ns) in the light green areas (225ns) of your first plot?

That isn't totally clear to me either. The benchmark returns the last few samples from a ring buffer. If the system was perturbed in any way at that point in time (network interrupt, SMIs,

For plotting, I'm running a Jupiter Notebook locally within VS Code, and plotting with a combination of matplotlib, and Seaborn. For the different graphs:

* heatmaps -> https://seaborn.pydata.org/generated/seaborn.heatmap.html

* kdeplot -> https://seaborn.pydata.org/generated/seaborn.kdeplot.html

* histplot -> https://seaborn.pydata.org/generated/seaborn.histplot.html

Expand full comment