Re: [PATCH v2] net/cls_cgroup: Fix task_get_classid() during qdisc run

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 29 Aug 2025 11:23:05 +0800

On Thu, Aug 28, 2025 at 3:55 PM Paolo Abeni <pabeni@xxxxxxxxxx> wrote:
>
> On 8/22/25 8:42 AM, Yafang Shao wrote:
> > During recent testing with the netem qdisc to inject delays into TCP
> > traffic, we observed that our CLS BPF program failed to function correctly
> > due to incorrect classid retrieval from task_get_classid(). The issue
> > manifests in the following call stack:
> >
> >         bpf_get_cgroup_classid+5
> >         cls_bpf_classify+507
> >         __tcf_classify+90
> >         tcf_classify+217
> >         __dev_queue_xmit+798
> >         bond_dev_queue_xmit+43
> >         __bond_start_xmit+211
> >         bond_start_xmit+70
> >         dev_hard_start_xmit+142
> >         sch_direct_xmit+161
> >         __qdisc_run+102             <<<<< Issue location
> >         __dev_xmit_skb+1015
> >         __dev_queue_xmit+637
> >         neigh_hh_output+159
> >         ip_finish_output2+461
> >         __ip_finish_output+183
> >         ip_finish_output+41
> >         ip_output+120
> >         ip_local_out+94
> >         __ip_queue_xmit+394
> >         ip_queue_xmit+21
> >         __tcp_transmit_skb+2169
> >         tcp_write_xmit+959
> >         __tcp_push_pending_frames+55
> >         tcp_push+264
> >         tcp_sendmsg_locked+661
> >         tcp_sendmsg+45
> >         inet_sendmsg+67
> >         sock_sendmsg+98
> >         sock_write_iter+147
> >         vfs_write+786
> >         ksys_write+181
> >         __x64_sys_write+25
> >         do_syscall_64+56
> >         entry_SYSCALL_64_after_hwframe+100
> >
> > The problem occurs when multiple tasks share a single qdisc. In such cases,
> > __qdisc_run() may transmit skbs created by different tasks. Consequently,
> > task_get_classid() retrieves an incorrect classid since it references the
> > current task's context rather than the skb's originating task.
> >
> > Given that dev_queue_xmit() always executes with bh disabled, we can safely
> > use in_softirq() instead of in_serving_softirq() to properly identify the
> > softirq context and obtain the correct classid.
> >
> > The simple steps to reproduce this issue:
> > 1. Add network delay to the network interface:
> >   such as: tc qdisc add dev bond0 root netem delay 1.5ms
> > 2. Create two distinct net_cls cgroups, each running a network-intensive task
> > 3. Initiate parallel TCP streams from both tasks to external servers.
> >
> > Under this specific condition, the issue reliably occurs. The kernel
> > eventually dequeues an SKB that originated from Task-A while executing in
> > the context of Task-B.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> > Cc: Thomas Graf <tgraf@xxxxxxx>
> > Cc: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> >
> > ---
> >
> > v1->v2: use softirq_count() instead of in_softirq()
> > ---
> >  include/net/cls_cgroup.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
> > index 7e78e7d6f015..668aeee9b3f6 100644
> > --- a/include/net/cls_cgroup.h
> > +++ b/include/net/cls_cgroup.h
> > @@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
> >        * calls by looking at the number of nested bh disable calls because
> >        * softirqs always disables bh.
> >        */
> > -     if (in_serving_softirq()) {
> > +     if (softirq_count()) {
> >               struct sock *sk = skb_to_full_sk(skb);
> >
> >               /* If there is an sock_cgroup_classid we'll use that. */
>
> AFAICS the above changes the established behavior for a slightly
> different scenario:

right.

>
> <sock S is created by task A>
> <class ID for task A is changed>
> <skb is created by sock S xmit and classified>
>
> prior to this patch the skb will be classified with the 'new' task A
> classid, now with the old/original one.
>
> I'm unsure if such behavior change is acceptable;

The classid of a skb is only meaningful within its original network
context, not from a random task.

> I think at very least
> it should be mentioned in the changelog and likely this change should
> target net-next.

Will add this to the commit log and tag it for net-next in the next version.

-- 
Regards
Yafang