Hi! I've been testing out some of the tc rule offloading features in the kernel and came across a weird issue for offloaded rules that use the connection tracking offload. The main idea was to implement a kind of basic routing offload, something akin to [1], with tc ct added to support creation of conntrack records, like in [2]. Compared against that was the nftables flowtables [3] that was set up to perform conntrack offloading in software. When using the tc ct hardware offload, however, a drastic increase in CPU usage was measured. On average, ~2.5 times higher, but increasing linearly with the number of offloaded flows. The testing consisted of sending a lot of short-lived UDP and/or TCP connections, at speeds of 4.5, 9, and 13.5 kcps. The CPU usage increase was measured in all cases. Kernel 6.6.9-1.7 was used, with ConnectX6-Dx NIC in switchdev mode and native mlx5 driver. Looking into the culprit with perf, the following three calls used up the most CPU time: 9.54% [kernel] [k] native_queued_spin_lock_slowpath 5.49% [kernel] [k] pv_native_safe_halt 3.39% [kernel] [k] rhashtable_jhash2 With the call stack on native_queued_spin_lock_slowpath: 0. ret_from_fork_asm 1. ret_from_fork 2. kthread 3. worker_thread 4. process_scheduled_works 5. process_one_work 6. nf_flow_offload_work_gc 7. nf_flow_table_iterate 8. nf_flow_offload_gc_step 9. nf_flow_offload_stats 10. flow_offload_queue_work 11. queue_work_on 12. __queue_work 13. _raw_spin_lock 14. native_queued_spin_lock_slowpath This led me to discover that the garbage collector in nf_flow_offload_gc_step is updating the statistics with nf_flow_offload_stats for every HW offloaded conntrack record, which seems to create a lot of collisions on the spinlock in workqueues, increasing the CPU usage with more HW offloaded flows. Has anyone encountered this issue before, and do you think this can be somehow mitigated? With regards, Pavel The tc rules used (it only routes the 48.0.0.0/16 back on the same device, while creating and using the conntrack entry): # Drop packets with TTL 1 tc filter add dev ${DEV} ingress \ chain 0 pref 1 protocol 802.1Q \ flower skip_sw vlan_ethtype ipv4 ip_ttl 1 \ action drop # Decapsulate VLAN tc filter add dev ${DEV} ingress \ chain 0 pref 1 protocol 802.1Q \ flower skip_sw vlan_ethtype ipv4 vlan_id ${VLAN} \ action vlan pop pipe action goto chain 1 # Track untracked connections tc filter add dev hge21 ingress \ chain 1 pref 1 protocol ip \ flower ct_state -trk \ action ct pipe action goto chain 2 # Send established connections to forwarding tc filter add dev ${DEV} ingress \ chain 2 pref 1 protocol ip \ flower ct_state +trk+est-new \ action goto chain 4 # Send newly tracked connections to routing tc filter add dev ${DEV} ingress \ chain 2 pref 1 protocol ip \ flower ct_state +trk+new \ action goto chain 3 # Tag connections in prefix 48.0.0.0/16 tc filter add dev ${DEV} ingress \ chain 3 pref 1 protocol ip \ flower dst_ip 48.0.0.0/16 \ action ct commit mark 0x01 pipe action goto chain 4 # Forward connections with tag 0x01 tc filter add dev ${DEV} ingress \ chain 4 pref 1 protocol ip \ flower ct_mark 0x01 \ action pedit ex munge eth dst set ${MAC_DST} munge eth src set ${MAC_SRC} munge ip ttl dec pipe \ csum ip4h pipe \ action vlan push id ${VLAN} pipe action mirred egress redirect dev ${DEV} [1] https://archive.fosdem.org/2024/events/attachments/fosdem-2024-3337-flying-higher-hardware-offloading-with-bird/slides/22273/flower-routed-fosdem24_gUABpPa.pdf [2] https://netdevconf.info/0x14/pub/slides/48/Netdev_0x14_CT_offload.pdf [3] https://wiki.nftables.org/wiki-nftables/index.php/Flowtables