Re: Unexplained variance in run-time of trivial program

Marc Gonzalez <marc.w.gonzalez@xxxxxxx> · Thu, 11 Sep 2025 23:29:48 +0200

On 9/10/25 09:59, Daniel Wagner wrote:

> On Tue, Sep 09, 2025 at 10:30:49PM +0200, Marc Gonzalez wrote:
>
>> On 9/9/25 21:34, Leon Woestenberg wrote:
>>
>>> Basically, you are reverse-tuning a multicore CPU system,
>>> that is heavily optimized for time-sharing and overall
>>> system throughput, back to act like bare-metal single-application.
>>> You can get close, but you will never reach the determinism
>>> of say, an Cortex-R(ealtime) Core.
>>
>> This ignores the fact that I've been testing 2 programs:
>>
>> - P1, toy code that just calls dec 3 times in a tight loop
>> - P2, a few memory accesses, everything fits in L1
>>
>> I then loop P1 & P2 2^16 times to compute their average run-time.
>>
>> Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm)
>>
>> Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = 122k ppm
> 
> Modern CPUs are highly complex systems with their micro architecture.
> The result is non-deterministic runtime behavior. I can't tell what you
> are measuring is the ballpark of non-deterministic runtime variance. If
> you don't want to read through the CPU manufactures documentation, you
> could check if the Spectre and related side channels attack papers give
> some clues on timings behavior?

I am well aware of the complexity of modern micro-architectures ;)

I actually reported a bug in the original K7 micro-architecture
25 years ago.

And Agner Fog's microarchitecture.pdf is always on my bedside table.
https://www.agner.org/optimize/microarchitecture.pdf

In an attempt to figure out the origin of the chaos, I broke out perf.
I would say I found many more questions than answers...

Running 3.3 GHz Haswell at 1.5 GHz
https://perfmon-events.intel.com/index.html?pltfrm=haswell_server.html
https://oprofile.sourceforge.io/docs/intel-haswell-events.php

GOOD RUN

D,C,Cb,F,T,N
1311453543 4317733068 4317732972 3292326 5002 262144

        7512849605      uops_executed.core                                                    
          31445738      uops_executed.stall_cycles                                            
         164966498      uops_retired.stall_cycles                                             
         574110144      topdown-fetch-bubbles                                                 
          25679080      cycle_activity.cycles_no_execute                                      
        6808318537      inst_retired.any                                                      
           6268865      int_misc.recovery_cycles                                              
           6268818      int_misc.recovery_cycles_any                                          

       1,313036051 seconds time elapsed

BAD RUN

D,C,Cb,F,T,N
1436503721 4729437143 4729436743 3292325 5479 262144

        7647384031      uops_executed.core                                                    
         113200734      uops_executed.stall_cycles                                            
         343460280      uops_retired.stall_cycles                                             
         787367726      topdown-fetch-bubbles                                                 
         106926806      cycle_activity.cycles_no_execute                                      
        6808330091      inst_retired.any                                                      
          17369470      int_misc.recovery_cycles                                              
          17369430      int_misc.recovery_cycles_any                                          

       1,438057158 seconds time elapsed

Remarks (in no particular order)

- Run-time increased by 9.5% (125 ms / 1.311 s)

- inst_retired.any remains (nearly) constant
BUT uops_executed.core increased by 1.8%
=> Where are those extra uops coming from???
Speculatively executed uops discarded on some flush event?

- 343460280-164966498 = 178.5M more stall cycles at retire stage
at 1.5 GHz = 119 ms
This alone could explain the degraded performance

- 113200734-31445738 = 81.7M more stall cycles at execute stage
some are probably dupes with retire stalls

- Not sure what the int_misc.recovery_* events measure???
They're documented as:
"This event counts the number of cycles spent waiting for a
recovery after an event such as a processor nuke, JEClear, assist,
hle/rtm abort etc."
and
"Core cycles the allocator was stalled due to recovery from earlier
clear event for any thread running on the physical core (e.g.
misprediction or memory nuke)."
=> In my case, they're probably measuring the same thing.
Weird that the description sounds a bit different.
Need to read up on processor nuke, memory nuke, machine clear event,
JEClear, assist, HLE/RTM...
I'll throw in MACHINE_CLEARS.CYCLES in the next run.
"Cycles there was a Nuke. Account for both thread-specific and All Thread Nukes."

I would love for someone to chime in & cast some light
on my utter confusion :)

Regards