Discussion about this post

User's avatar
Keith P.'s avatar

I think I found a minor bug in Jason Rahman's code for calculating round trip times between CPU cores. Lines 126 to 128 (inside the server function) are:

localTickReqReceived = rdtsc();

data.serverToClientChannel.store(((uint64_t)((uint32_t)(localTickRespSend = rdtsc())) << 32) | (uint64_t)(uint32_t)localTickReqReceived);

Suppose the 64-bit values returned from the two calls to the time stamp counter rdtsc() have the following form:

first rdtsc() call: [32-bit value]FFFFFFFX

second rdtsc() call: [incremented 32-bit value]0000000Y

X and Y are arbitrary 4-bit values.

The lower 32-bits from the two calls are extracted and stored like this: 0x0000000YFFFFFFFX

Lines 93 to 100 (inside the client function) are:

localTickRespReceive = rdtsc();

auto upperBits = (localTickRespReceive & 0xffffffff00000000);

remoteTickRespSend = ((resp & 0xffffffff00000000) >> 32) | upperBits;

remoteTickReqReceive = (resp & 0x00000000ffffffff) | upperBits;

auto offset = (((int64_t)remoteTickReqReceive - (int64_t)localTickReqSend) + ((int64_t)remoteTickRespSend - (int64_t)localTickRespReceive)) / 2;

auto rtt = (localTickRespReceive - localTickReqSend) - (remoteTickRespSend - remoteTickReqReceive);

resp is loaded with 0x0000000YFFFFFFFX in line 88.

(remoteTickRespSend - remoteTickReqReceive) will be the upper 32-bits of resp minus the lower 32-bits of resp, which will be a negative number. This is surely not what was intended. I don't understand the code well enough to suggest a good fix but one idea would be to test if (remoteTickRespSend - remoteTickReqReceive) is negative and add 0x100000000 if it is. This added constant is [incremented 32-bit value] minus [32-bit value] from the two rdtsc() calls mentioned earlier. An alternative is to subtract the two 64-bit values of rdtsc() in the server function and send the lower 32-bits of the difference to the client function. By the way, what is the offset variable and how is it used?

What is (remoteTickRespSend - remoteTickReqReceive) suppose to measure? It looks to me that variability in (remoteTickRespSend - remoteTickReqReceive) is just a result of differences in out-of-order execution of instructions within the server function rather than differences in round trip time. I wonder if it would be better to delete (remoteTickRespSend - remoteTickReqReceive) from the expression for rtt.

I don't think fixing this minor bug will change the results that much. At a core frequency of 2 GHz, there is a carry across the 32-bit boundary of the time stamp counter roughly once every 2 seconds. The chance of the two rdtsc() calls in the server function straddling that seems pretty small.

In both the client and server functions, it may be better to use the newer rdtscp instruction instead of rdtsc. rdtscp is suppose to reduce the variability of measurements on processors that do out-of-order execution. All Xeon processors and all recent consumer processors do out-of-order execution.

Expand full comment
Linji's avatar

Hi Jason, I am confused by the coloring of the heat map. From the single socket 8488 one, it seems to me that the upper left most small tile, 0-11 to 0-11, seems to have the deepest color / meaning largest latency.. is that right? feels counter intuitive.. or am I reading it wrong? Thanks.

Expand full comment
17 more comments...

No posts