On 9/10/25 09:59, Daniel Wagner wrote: > On Tue, Sep 09, 2025 at 10:30:49PM +0200, Marc Gonzalez wrote: > >> On 9/9/25 21:34, Leon Woestenberg wrote: >> >>> Basically, you are reverse-tuning a multicore CPU system, >>> that is heavily optimized for time-sharing and overall >>> system throughput, back to act like bare-metal single-application. >>> You can get close, but you will never reach the determinism >>> of say, an Cortex-R(ealtime) Core. >> >> This ignores the fact that I've been testing 2 programs: >> >> - P1, toy code that just calls dec 3 times in a tight loop >> - P2, a few memory accesses, everything fits in L1 >> >> I then loop P1 & P2 2^16 times to compute their average run-time. >> >> Over thousands of runs, RUNTIME_P1 is ALWAYS the same (within 30 ppm) >> >> Over thousands of runs, RUNTIME_P2 varies by up to 12 PERCENT (!!) = 122k ppm > > Modern CPUs are highly complex systems with their micro architecture. > The result is non-deterministic runtime behavior. I can't tell what you > are measuring is the ballpark of non-deterministic runtime variance. If > you don't want to read through the CPU manufactures documentation, you > could check if the Spectre and related side channels attack papers give > some clues on timings behavior? I am well aware of the complexity of modern micro-architectures ;) I actually reported a bug in the original K7 micro-architecture 25 years ago. And Agner Fog's microarchitecture.pdf is always on my bedside table. https://www.agner.org/optimize/microarchitecture.pdf In an attempt to figure out the origin of the chaos, I broke out perf. I would say I found many more questions than answers... Running 3.3 GHz Haswell at 1.5 GHz https://perfmon-events.intel.com/index.html?pltfrm=haswell_server.html https://oprofile.sourceforge.io/docs/intel-haswell-events.php GOOD RUN D,C,Cb,F,T,N 1311453543 4317733068 4317732972 3292326 5002 262144 7512849605 uops_executed.core 31445738 uops_executed.stall_cycles 164966498 uops_retired.stall_cycles 574110144 topdown-fetch-bubbles 25679080 cycle_activity.cycles_no_execute 6808318537 inst_retired.any 6268865 int_misc.recovery_cycles 6268818 int_misc.recovery_cycles_any 1,313036051 seconds time elapsed BAD RUN D,C,Cb,F,T,N 1436503721 4729437143 4729436743 3292325 5479 262144 7647384031 uops_executed.core 113200734 uops_executed.stall_cycles 343460280 uops_retired.stall_cycles 787367726 topdown-fetch-bubbles 106926806 cycle_activity.cycles_no_execute 6808330091 inst_retired.any 17369470 int_misc.recovery_cycles 17369430 int_misc.recovery_cycles_any 1,438057158 seconds time elapsed Remarks (in no particular order) - Run-time increased by 9.5% (125 ms / 1.311 s) - inst_retired.any remains (nearly) constant BUT uops_executed.core increased by 1.8% => Where are those extra uops coming from??? Speculatively executed uops discarded on some flush event? - 343460280-164966498 = 178.5M more stall cycles at retire stage at 1.5 GHz = 119 ms This alone could explain the degraded performance - 113200734-31445738 = 81.7M more stall cycles at execute stage some are probably dupes with retire stalls - Not sure what the int_misc.recovery_* events measure??? They're documented as: "This event counts the number of cycles spent waiting for a recovery after an event such as a processor nuke, JEClear, assist, hle/rtm abort etc." and "Core cycles the allocator was stalled due to recovery from earlier clear event for any thread running on the physical core (e.g. misprediction or memory nuke)." => In my case, they're probably measuring the same thing. Weird that the description sounds a bit different. Need to read up on processor nuke, memory nuke, machine clear event, JEClear, assist, HLE/RTM... I'll throw in MACHINE_CLEARS.CYCLES in the next run. "Cycles there was a Nuke. Account for both thread-specific and All Thread Nukes." I would love for someone to chime in & cast some light on my utter confusion :) Regards