On 8/22/25 09:42, Yafang Shao wrote: > During recent testing with the netem qdisc to inject delays into TCP > traffic, we observed that our CLS BPF program failed to function correctly > due to incorrect classid retrieval from task_get_classid(). The issue > manifests in the following call stack: > > bpf_get_cgroup_classid+5 > cls_bpf_classify+507 > __tcf_classify+90 > tcf_classify+217 > __dev_queue_xmit+798 > bond_dev_queue_xmit+43 > __bond_start_xmit+211 > bond_start_xmit+70 > dev_hard_start_xmit+142 > sch_direct_xmit+161 > __qdisc_run+102 <<<<< Issue location > __dev_xmit_skb+1015 > __dev_queue_xmit+637 > neigh_hh_output+159 > ip_finish_output2+461 > __ip_finish_output+183 > ip_finish_output+41 > ip_output+120 > ip_local_out+94 > __ip_queue_xmit+394 > ip_queue_xmit+21 > __tcp_transmit_skb+2169 > tcp_write_xmit+959 > __tcp_push_pending_frames+55 > tcp_push+264 > tcp_sendmsg_locked+661 > tcp_sendmsg+45 > inet_sendmsg+67 > sock_sendmsg+98 > sock_write_iter+147 > vfs_write+786 > ksys_write+181 > __x64_sys_write+25 > do_syscall_64+56 > entry_SYSCALL_64_after_hwframe+100 > > The problem occurs when multiple tasks share a single qdisc. In such cases, > __qdisc_run() may transmit skbs created by different tasks. Consequently, > task_get_classid() retrieves an incorrect classid since it references the > current task's context rather than the skb's originating task. > > Given that dev_queue_xmit() always executes with bh disabled, we can safely > use in_softirq() instead of in_serving_softirq() to properly identify the > softirq context and obtain the correct classid. > nit: you are no longer using in_softirq() in v2, you should update the commit message as well. [snip] Cheers, Nik