On Fri, Sep 05, 2025 at 10:33:45AM -0700, Amery Hung wrote: > xdp programs can change the layout of an xdp_buff through > bpf_xdp_adjust_tail() and bpf_xdp_adjust_head(). Therefore, the driver > cannot assume the size of the linear data area nor fragments. Fix the > bug in mlx5 by generating skb according to xdp_buff after xdp programs > run. > Shouldn't this patch be a fix for net then? > Currently, when handling multi-buf xdp, the mlx5 driver assumes the > layout of an xdp_buff to be unchanged. That is, the linear data area > continues to be empty and fragments remains the same. This may cause > the driver to generate erroneous skb or triggering a kernel > warning. When an xdp program added linear data through > bpf_xdp_adjust_head(), the linear data will be ignored as > mlx5e_build_linear_skb() builds an skb without linear data and then > pull data from fragments to fill the linear data area. When an xdp > program has shrunk the non-linear data through bpf_xdp_adjust_tail(), > the delta passed to __pskb_pull_tail() may exceed the actual nonlinear > data size and trigger the BUG_ON in it. > > To fix the issue, first record the original number of fragments. If the > number of fragments changes after the xdp program runs, rewind the end > fragment pointer by the difference and recalculate the truesize. Then, > build the skb with linear data area matching the xdp_buff. Finally, only > pull data in if there is non-linear data and fill the linear part up to > 256 bytes. > > Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ") Your fix covers both Legacy RQ and Striding RQ. So the tag is only 1/2 correct. Normally we have separate patches for each mode. > Signed-off-by: Amery Hung <ameryhung@xxxxxxxxx> > --- > .../net/ethernet/mellanox/mlx5/core/en_rx.c | 38 +++++++++++++++++-- > 1 file changed, 35 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > index b8c609d91d11..6b6bb90cf003 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > @@ -1729,6 +1729,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi > struct mlx5e_wqe_frag_info *head_wi = wi; > u16 rx_headroom = rq->buff.headroom; > struct mlx5e_frag_page *frag_page; > + u8 nr_frags_free, old_nr_frags; > struct skb_shared_info *sinfo; > u32 frag_consumed_bytes; > struct bpf_prog *prog; > @@ -1772,17 +1773,27 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi > wi++; > } > > + old_nr_frags = sinfo->nr_frags; > + > prog = rcu_dereference(rq->xdp_prog); > if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) { > if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) { > struct mlx5e_wqe_frag_info *pwi; > > + wi -= old_nr_frags - sinfo->nr_frags; > + > for (pwi = head_wi; pwi < wi; pwi++) > pwi->frag_page->frags++; > } > return NULL; /* page/packet was consumed by XDP */ > } > > + nr_frags_free = old_nr_frags - sinfo->nr_frags; > + if (unlikely(nr_frags_free)) { Even with with a branch prediction hint, is it really worth it? > + wi -= nr_frags_free; > + truesize -= nr_frags_free * frag_info->frag_stride; > + } > + > skb = mlx5e_build_linear_skb( > rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz, > mxbuf->xdp.data - mxbuf->xdp.data_hard_start, > @@ -2004,6 +2015,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > u32 byte_cnt = cqe_bcnt; > struct skb_shared_info *sinfo; > unsigned int truesize = 0; > + u32 pg_consumed_bytes; > struct bpf_prog *prog; > struct sk_buff *skb; > u32 linear_frame_sz; > @@ -2057,7 +2069,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > while (byte_cnt) { > /* Non-linear mode, hence non-XSK, which always uses PAGE_SIZE. */ > - u32 pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt); > + pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt); > > if (test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state)) > truesize += pg_consumed_bytes; > @@ -2073,10 +2085,15 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > } > > if (prog) { > + u8 nr_frags_free, old_nr_frags = sinfo->nr_frags; > + u32 len; > + > if (mlx5e_xdp_handle(rq, prog, mxbuf)) { > if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) { > struct mlx5e_frag_page *pfp; > > + frag_page -= old_nr_frags - sinfo->nr_frags; > + > for (pfp = head_page; pfp < frag_page; pfp++) > pfp->frags++; > > @@ -2087,9 +2104,22 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > return NULL; /* page/packet was consumed by XDP */ > } > > + len = mxbuf->xdp.data_end - mxbuf->xdp.data; > + > + nr_frags_free = old_nr_frags - sinfo->nr_frags; > + if (unlikely(nr_frags_free)) { Same question about the if. > + frag_page -= nr_frags_free; > + > + /* the last frag is always freed first */ > + truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)); > + while (--nr_frags_free) > + truesize -= nr_frags_free * > + ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz)); > + } > + This doesn't seem correct. It seems to remove too much from truesize when nr_frags_free > 2. I think it should be: truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) - (nr_frags_free - 1) * ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz)); And PAGE_SIZE is aligned to stride size so you can shorted it to: truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) - (nr_frags_free - 1) * PAGE_SIZE; Thanks, Dragos