On Mon, Sep 8, 2025 at 7:42 AM Dragos Tatulea <dtatulea@xxxxxxxxxx> wrote: Resending the reply to the list again as some html stuff accidentally got mixed in > > On Fri, Sep 05, 2025 at 10:33:45AM -0700, Amery Hung wrote: > > xdp programs can change the layout of an xdp_buff through > > bpf_xdp_adjust_tail() and bpf_xdp_adjust_head(). Therefore, the driver > > cannot assume the size of the linear data area nor fragments. Fix the > > bug in mlx5 by generating skb according to xdp_buff after xdp programs > > run. > > > Shouldn't this patch be a fix for net then? Make sense. I will separate the mlx5 patch from this set and target net. > > > Currently, when handling multi-buf xdp, the mlx5 driver assumes the > > layout of an xdp_buff to be unchanged. That is, the linear data area > > continues to be empty and fragments remains the same. This may cause > > the driver to generate erroneous skb or triggering a kernel > > warning. When an xdp program added linear data through > > bpf_xdp_adjust_head(), the linear data will be ignored as > > mlx5e_build_linear_skb() builds an skb without linear data and then > > pull data from fragments to fill the linear data area. When an xdp > > program has shrunk the non-linear data through bpf_xdp_adjust_tail(), > > the delta passed to __pskb_pull_tail() may exceed the actual nonlinear > > data size and trigger the BUG_ON in it. > > > > To fix the issue, first record the original number of fragments. If the > > number of fragments changes after the xdp program runs, rewind the end > > fragment pointer by the difference and recalculate the truesize. Then, > > build the skb with linear data area matching the xdp_buff. Finally, only > > pull data in if there is non-linear data and fill the linear part up to > > 256 bytes. > > > > Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ") > Your fix covers both Legacy RQ and Striding RQ. So the tag is only 1/2 > correct. Normally we have separate patches for each mode. Will split the patch into two. > > > > > Signed-off-by: Amery Hung <ameryhung@xxxxxxxxx> > > --- > > .../net/ethernet/mellanox/mlx5/core/en_rx.c | 38 +++++++++++++++++-- > > 1 file changed, 35 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > index b8c609d91d11..6b6bb90cf003 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > @@ -1729,6 +1729,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi > > struct mlx5e_wqe_frag_info *head_wi = wi; > > u16 rx_headroom = rq->buff.headroom; > > struct mlx5e_frag_page *frag_page; > > + u8 nr_frags_free, old_nr_frags; > > struct skb_shared_info *sinfo; > > u32 frag_consumed_bytes; > > struct bpf_prog *prog; > > @@ -1772,17 +1773,27 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi > > wi++; > > } > > > > + old_nr_frags = sinfo->nr_frags; > > + > > prog = rcu_dereference(rq->xdp_prog); > > if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) { > > if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) { > > struct mlx5e_wqe_frag_info *pwi; > > > > + wi -= old_nr_frags - sinfo->nr_frags; > > + > > for (pwi = head_wi; pwi < wi; pwi++) > > pwi->frag_page->frags++; > > } > > return NULL; /* page/packet was consumed by XDP */ > > } > > > > + nr_frags_free = old_nr_frags - sinfo->nr_frags; > > + if (unlikely(nr_frags_free)) { > Even with with a branch prediction hint, is it really worth it? > [...] > > > + wi -= nr_frags_free; > > + truesize -= nr_frags_free * frag_info->frag_stride; > > + } > > + > > skb = mlx5e_build_linear_skb( > > rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz, > > mxbuf->xdp.data - mxbuf->xdp.data_hard_start, > > @@ -2004,6 +2015,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > u32 byte_cnt = cqe_bcnt; > > struct skb_shared_info *sinfo; > > unsigned int truesize = 0; > > + u32 pg_consumed_bytes; > > struct bpf_prog *prog; > > struct sk_buff *skb; > > u32 linear_frame_sz; > > @@ -2057,7 +2069,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > while (byte_cnt) { > > /* Non-linear mode, hence non-XSK, which always uses PAGE_SIZE. */ > > - u32 pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt); > > + pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt); > > > > if (test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state)) > > truesize += pg_consumed_bytes; > > @@ -2073,10 +2085,15 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > } > > > > if (prog) { > > + u8 nr_frags_free, old_nr_frags = sinfo->nr_frags; > > + u32 len; > > + > > if (mlx5e_xdp_handle(rq, prog, mxbuf)) { > > if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) { > > struct mlx5e_frag_page *pfp; > > > > + frag_page -= old_nr_frags - sinfo->nr_frags; > > + > > for (pfp = head_page; pfp < frag_page; pfp++) > > pfp->frags++; > > > > @@ -2087,9 +2104,22 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > return NULL; /* page/packet was consumed by XDP */ > > } > > > > + len = mxbuf->xdp.data_end - mxbuf->xdp.data; > > + > > + nr_frags_free = old_nr_frags - sinfo->nr_frags; > > + if (unlikely(nr_frags_free)) { > Same question about the if. I see. I will make the recalculation unconditional. > > > + frag_page -= nr_frags_free; > > + > > + /* the last frag is always freed first */ > > + truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)); > > + while (--nr_frags_free) > > + truesize -= nr_frags_free * > > + ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz)); > > + } > > + > This doesn't seem correct. It seems to remove too much from truesize > when nr_frags_free > 2. I think it should be: > > truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) - > (nr_frags_free - 1) * ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz)); > > And PAGE_SIZE is aligned to stride size so you can shorted it to: > > truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) - > (nr_frags_free - 1) * PAGE_SIZE; Sorry that I was being sloppy here. You are correct, and I think you probably meant "+" instead of "-". truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) + (nr_frags_free - 1) * PAGE_SIZE; > > Thanks, > Dragos >