On 08/09/2025 20:40, Marc Gonzalez wrote: > On 08/09/2025 18:02, Daniel Wagner wrote: > >> On Mon, Sep 08, 2025 at 05:42:58PM +0200, Marc Gonzalez wrote: >> >>> I've run out of ideas to identify other sources of variance. >>> (I ran everything in single user mode with nothing else running.) >>> Perhaps with perf I could identify the source of stalls or bubbles? >>> >>> Hoping someone can point me in the right direction. >> >> There are few guides out there, explaining the common pitfalls. I >> suppose you have already found them. Just in case you missed Frederic's >> excellent blog on CPU isolation: >> >> https://www.suse.com/c/cpu-isolation-introduction-part-1/ > > Thanks for this great link! > > Is nohz_full=X a superset of isolcpus=X or are they orthogonal? I am now more confused than ever. Following Frederic's guide: https://www.suse.com/c/cpu-isolation-practical-example-part-5/ If I didn't mess up, CPU3 is now supposed to be fully isolated on my 4-core system: $ sudo dmesg | grep "Kernel command line" [ 0.038445] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-79-generic root=UUID=2aff6d63-a8c6-4ebb-af69-9d2df2852cb0 ro log_buf_len=16M nohz_full=3 isolcpus=3 quiet splash vt.handoff=7 I ran echo "0-2" > $IRQ/smp_affinity_list for every IRQ. IRQ=0 AFTER SMP_AFF=0-3 EFF_AFF=0 IRQ=1 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=2 AFTER SMP_AFF=0-3 EFF_AFF= IRQ=3 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=4 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=5 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=6 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=7 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=8 AFTER SMP_AFF=0-2 EFF_AFF=0 IRQ=9 AFTER SMP_AFF=0-2 EFF_AFF=1 IRQ=10 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=11 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=12 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=13 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=14 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=15 AFTER SMP_AFF=0-2 EFF_AFF=0-2 IRQ=16 AFTER SMP_AFF=0-2 EFF_AFF=1 IRQ=18 AFTER SMP_AFF=0-2 EFF_AFF=1 IRQ=23 AFTER SMP_AFF=0-2 EFF_AFF=2 IRQ=24 AFTER SMP_AFF=0-2 EFF_AFF=2 IRQ=25 AFTER SMP_AFF=0-2 EFF_AFF=3 IRQ=26 AFTER SMP_AFF=0-2 EFF_AFF=0 IRQ=27 AFTER SMP_AFF=0-2 EFF_AFF=0 IRQ=28 AFTER SMP_AFF=0-2 EFF_AFF=2 IRQ=29 AFTER SMP_AFF=0-2 EFF_AFF=2 IRQ=30 AFTER SMP_AFF=0-2 EFF_AFF=3 IRQ=31 AFTER SMP_AFF=0-2 EFF_AFF=0 IRQ=32 AFTER SMP_AFF=0-2 EFF_AFF=1 I note that IRQs 25 & 30 are still effectively pinned to CPU3, despite the smp_aff setting. $ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 27 0 0 0 IO-APIC 2-edge timer 8: 0 0 0 0 IO-APIC 8-edge rtc0 9: 0 4 0 0 IO-APIC 9-fasteoi acpi 16: 0 25 0 0 IO-APIC 16-fasteoi ehci_hcd:usb1 18: 0 5 0 0 IO-APIC 18-fasteoi i801_smbus 23: 0 0 29 0 IO-APIC 23-fasteoi ehci_hcd:usb2 24: 0 0 0 0 PCI-MSI-0000:00:1c.0 0-edge PCIe PME, pciehp 25: 0 0 0 0 PCI-MSI-0000:00:1c.3 0-edge PCIe PME 26: 23328 0 0 37250 PCI-MSI-0000:00:1f.2 0-edge ahci[0000:00:1f.2] 27: 86091 0 0 0 PCI-MSI-0000:00:14.0 0-edge xhci_hcd 28: 0 0 51308 0 PCI-MSIX-0000:02:00.0 0-edge enp2s0 29: 0 0 22 0 PCI-MSI-0000:00:16.0 0-edge mei_me 30: 0 0 0 604 PCI-MSI-0000:00:1b.0 0-edge snd_hda_intel:card0 31: 198664 0 0 0 PCI-MSI-0000:00:02.0 0-edge i915 32: 0 631 0 0 PCI-MSI-0000:00:03.0 0-edge snd_hda_intel:card1 My benchmark is a trivial piece of code: - no system calls, no library calls - just simple bit twiddling - tiny code, tiny data structures, everything fits in L1 $ size a.out text data bss dec hex filename 8549 632 1072 10253 280d a.out Undisturbed, it runs in 2500 nanoseconds (with all cores set to tick at 3 GHz). I run it 2^16 times, so the expected run-time is 2500*2^16 = 163840000 ns = 163.84 ms Yet, I've observed AVERAGE run-times of 2600, 2700, and even 2800 (!!) nanoseconds. I captured a full trace of this occurrence with echo *:* > $TRACING/set_event cat $TRACING/per_cpu/cpu$CPU/trace > trace.$CPU # tracer: nop # # entries-in-buffer/entries-written: 9743/9743 #P:4 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <idle>-0 [003] dN.1. 4606.627286: cpu_idle: state=4294967295 cpu_id=3 <idle>-0 [003] dN.1. 4606.627288: rcu_utilization: Start context switch <idle>-0 [003] dN.1. 4606.627288: rcu_utilization: End context switch <idle>-0 [003] d..2. 4606.627290: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=taskset next_pid=8129 next_prio=0 <idle>-0 [003] d..2. 4606.627291: tlb_flush: pages:-1 reason:flush on task switch (0) a.out-8129 [003] ...1. 4606.627304: mm_page_free: page=000000001fef04a8 pfn=0x16a586 order=1 ... a.out-8129 [003] d.... 4606.628019: user_enter: a.out-8129 [003] d.... 4606.811832: user_exit: Program starts at 4606.627304 Program setup lasts until 4606.628019 (~1500 recorded events) => roughly 700 microseconds spent before main() which I'm not taking into account in my measurements. Then program ends at 4606.811832 => 183.813 ms instead of the expected 163.84 ms Almost exactly 20 ms in excess. Could this be a hint? But there is absolutely NOTHING traced between 4606.628019 & 4606.811832. I guess either my time source is incorrect. (Next slide in Frederic's guide) OR there is something wonky going on inside CPU3. Looks like IRQs are not the issue here? No need to investigate why I can't pin IRQs 25 & 30 elsewhere than CPU3? Regards