On Fri, Aug 29, 2025 at 4:14 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 8/29/25 5:23 AM, Yafang Shao wrote: > > On Thu, Aug 28, 2025 at 3:55 PM Paolo Abeni <pabeni@xxxxxxxxxx> wrote: > >> On 8/22/25 8:42 AM, Yafang Shao wrote: > >>> During recent testing with the netem qdisc to inject delays into TCP > >>> traffic, we observed that our CLS BPF program failed to function correctly > >>> due to incorrect classid retrieval from task_get_classid(). The issue > >>> manifests in the following call stack: > >>> > >>> bpf_get_cgroup_classid+5 > >>> cls_bpf_classify+507 > >>> __tcf_classify+90 > >>> tcf_classify+217 > >>> __dev_queue_xmit+798 > >>> bond_dev_queue_xmit+43 > >>> __bond_start_xmit+211 > >>> bond_start_xmit+70 > >>> dev_hard_start_xmit+142 > >>> sch_direct_xmit+161 > >>> __qdisc_run+102 <<<<< Issue location > >>> __dev_xmit_skb+1015 > >>> __dev_queue_xmit+637 > >>> neigh_hh_output+159 > >>> ip_finish_output2+461 > >>> __ip_finish_output+183 > >>> ip_finish_output+41 > >>> ip_output+120 > >>> ip_local_out+94 > >>> __ip_queue_xmit+394 > >>> ip_queue_xmit+21 > >>> __tcp_transmit_skb+2169 > >>> tcp_write_xmit+959 > >>> __tcp_push_pending_frames+55 > >>> tcp_push+264 > >>> tcp_sendmsg_locked+661 > >>> tcp_sendmsg+45 > >>> inet_sendmsg+67 > >>> sock_sendmsg+98 > >>> sock_write_iter+147 > >>> vfs_write+786 > >>> ksys_write+181 > >>> __x64_sys_write+25 > >>> do_syscall_64+56 > >>> entry_SYSCALL_64_after_hwframe+100 > >>> > >>> The problem occurs when multiple tasks share a single qdisc. In such cases, > >>> __qdisc_run() may transmit skbs created by different tasks. Consequently, > >>> task_get_classid() retrieves an incorrect classid since it references the > >>> current task's context rather than the skb's originating task. > >>> > >>> Given that dev_queue_xmit() always executes with bh disabled, we can safely > >>> use in_softirq() instead of in_serving_softirq() to properly identify the > >>> softirq context and obtain the correct classid. > >>> > >>> The simple steps to reproduce this issue: > >>> 1. Add network delay to the network interface: > >>> such as: tc qdisc add dev bond0 root netem delay 1.5ms > >>> 2. Create two distinct net_cls cgroups, each running a network-intensive task > >>> 3. Initiate parallel TCP streams from both tasks to external servers. > >>> > >>> Under this specific condition, the issue reliably occurs. The kernel > >>> eventually dequeues an SKB that originated from Task-A while executing in > >>> the context of Task-B. > >>> > >>> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > >>> Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx> > >>> Cc: Thomas Graf <tgraf@xxxxxxx> > >>> Cc: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> > >>> > >>> v1->v2: use softirq_count() instead of in_softirq() > >>> --- > >>> include/net/cls_cgroup.h | 2 +- > >>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>> > >>> diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h > >>> index 7e78e7d6f015..668aeee9b3f6 100644 > >>> --- a/include/net/cls_cgroup.h > >>> +++ b/include/net/cls_cgroup.h > >>> @@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb) > >>> * calls by looking at the number of nested bh disable calls because > >>> * softirqs always disables bh. > >>> */ > >>> - if (in_serving_softirq()) { > >>> + if (softirq_count()) { > >>> struct sock *sk = skb_to_full_sk(skb); > >>> > >>> /* If there is an sock_cgroup_classid we'll use that. */ > >> > >> AFAICS the above changes the established behavior for a slightly > >> different scenario: > > > > right. > > > >> <sock S is created by task A> > >> <class ID for task A is changed> > >> <skb is created by sock S xmit and classified> > >> > >> prior to this patch the skb will be classified with the 'new' task A > >> classid, now with the old/original one. > >> > >> I'm unsure if such behavior change is acceptable; > > > > The classid of a skb is only meaningful within its original network > > context, not from a random task. > > Do you mean by original network context original netns? We also have > bpf_skb_cgroup_classid() as well as bpf_get_cgroup_classid_curr(), both > exposed to tcx, which kind of detangles what task_get_classid() is doing. > I guess if you have apps in its own netns and the skb->sk is retained all > the way to phys dev in hostns then bpf_skb_cgroup_classid() might be a > better choice (assuming classid stays constant from container orchestrator > PoV). Right. We have replaced bpf_get_cgroup_classid() with bpf_skb_cgroup_classid() to handle this case. Nonetheless, I believe we still need to fix bpf_get_cgroup_classid(), since this function can easily mislead users. -- Regards Yafang