On Thu, Aug 28, 2025 at 9:23 AM Dragos Tatulea <dtatulea@xxxxxxxxxx> wrote: > > On Wed, Aug 27, 2025 at 08:44:24PM -0700, Amery Hung wrote: > > On Wed, Aug 27, 2025 at 6:45 AM Dragos Tatulea <dtatulea@xxxxxxxxxx> wrote: > > > > > > On Mon, Aug 25, 2025 at 12:39:12PM -0700, Amery Hung wrote: > > > > [...] > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > index b8c609d91d11..c5173f1ccb4e 100644 > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > > > @@ -1725,16 +1725,17 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi > > > > struct mlx5_cqe64 *cqe, u32 cqe_bcnt) > > > > { > > > > struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0]; > > > > + struct mlx5e_wqe_frag_info *pwi, *head_wi = wi; > > > > struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf; > > > > - struct mlx5e_wqe_frag_info *head_wi = wi; > > > > u16 rx_headroom = rq->buff.headroom; > > > > struct mlx5e_frag_page *frag_page; > > > > struct skb_shared_info *sinfo; > > > > - u32 frag_consumed_bytes; > > > > + u32 frag_consumed_bytes, i; > > > > struct bpf_prog *prog; > > > > struct sk_buff *skb; > > > > dma_addr_t addr; > > > > u32 truesize; > > > > + u8 nr_frags; > > > > void *va; > > > > > > > > frag_page = wi->frag_page; > > > > @@ -1775,14 +1776,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi > > > > prog = rcu_dereference(rq->xdp_prog); > > > > if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) { > > > > if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) { > > > > - struct mlx5e_wqe_frag_info *pwi; > > > > + pwi = head_wi; > > > > + while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi) > > > > + pwi++; > > > > > > > Is this trying to skip counting the frags for the linear part? If yes, > > > don't understand the reasoning. If not, I don't follow the code. > > > > > > AFAIU frags have to be counted for the linear part + sinfo->nr_frags. > > > Frags could be less after xdp program execution, but the linear part is > > > still there. > > > > > > > This is to search the first frag after xdp runs because I thought it > > is possible that the first frag (head_wi+1) might be released by > > bpf_xdp_pull_data() and then the frag will start from head_wi+2. > > > > After sleeping on it a bit, it seems it is not possible as there is > > not enough room in the linear to completely pull PAGE_SIZE byte of > > data from the first frag to the linear area. Is this correct? > > > Right. AFAIU the usable linear part is smaller due to headroom and > tailroom. > > [...] > > > > if (unlikely(!skb)) { > > > > mlx5e_page_release_fragmented(rq->page_pool, > > > > @@ -2102,20 +2124,25 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w > > > > mlx5e_page_release_fragmented(rq->page_pool, &wi->linear_page); > > > > > > > > if (xdp_buff_has_frags(&mxbuf->xdp)) { > > > > - struct mlx5e_frag_page *pagep; > > > > + struct mlx5e_frag_page *pagep = head_page; > > > > + > > > > + truesize = nr_frags * PAGE_SIZE; > > > I am not sure that this is accurate. The last fragment might be smaller > > > than page size. It should be aligned to BIT(rq->mpwqe.log_stride_sz). > > > > > > > According to the truesize calculation in > > mlx5e_skb_from_cqe_mpwrq_nonlinear() just before mlx5e_xdp_handle(). > > After the first frag, the frag_offset is always 0 and > > pg_consumed_bytes will be PAGE_SIZE. Therefore the last page also > > consumes a page, no? > > > My understanding is that the last pg_consumed_bytes will be a byte_cnt > that is smaller than PAGE_SIZE as there is a min operation. The remaining byte_cnt will then be aligned to BIT(rq->mpwqe.log_stride_sz), which is PAGE_SHIFT if there is xdp program (per mlx5e_mpwqe_get_log_stride_size()). Therefore, it still adds a page to truesize. I will change to ALIGN(..., BIT(rq->mpwqe.log_stride_sz)) to be consistent with existing code. > > > If the last page has variable size, I wonder how can > > bpf_xdp_adjust_tail() handle a dynamic tailroom. > That is a good point. So this can stay as is I guess. > > > bpf_xdp_adjust_tail() > > requires a driver to specify a static frag size (the maximum size a > > frag can grow) when calling __xdp_rxq_info_reg(), which seem to be a > > page in mlx5. > > > This is an issue raised by Nimrod as well. Currently striding rq sets > rxq->frag_size to 0. It is set to PAGE_SIZE only in legacy rq mode. > I see. > > > > > > > > > > /* sinfo->nr_frags is reset by build_skb, calculate again. */ > > > > - xdp_update_skb_shared_info(skb, frag_page - head_page, > > > > + xdp_update_skb_shared_info(skb, nr_frags, > > > > sinfo->xdp_frags_size, truesize, > > > > xdp_buff_is_frag_pfmemalloc( > > > > &mxbuf->xdp)); > > > > > > > > - pagep = head_page; > > > > - do > > > > + while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page) > > > > + pagep++; > > > > + > > > > + for (i = 0; i < nr_frags; i++, pagep++) > > > > pagep->frags++; > > > > - while (++pagep < frag_page); > > > > + > > > > + headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size); > > > > + __pskb_pull_tail(skb, headlen); > > > > } > > > > - __pskb_pull_tail(skb, headlen); > > > What happens when there are no more frags? (bpf_xdp_frags_shrink_tail() > > > shrinked them out). Is that at all possible? > > > > It is possible for bpf_xdp_frags_shrink_tail() to release all frags. > > There is no limit of how much they can shrink. If there is linear > > data, the kfunc allows shrinking data_end until ETH_HLEN. Before this > > patchset, it could trigger a BUG_ON in __pskb_pull_tail(). After this > > set, the driver will pass a empty skb to the upper layer. > > > I see what you mean. > > > For bpf_xdp_pull_data(), in the case of mlx5, I think it is only > > possible to release all frags when the first and only frag contains > > less than 256 bytes, which is the free space in the linear page. > > > Why would only 256 bytes be free in the linear area? My understanding > is that we have PAGE_SIZE - headroom - tailroom which should be more? > mlx5e_skb_from_cqe_mpwrq_nonlinear() currently sets xdp->frame_sz to be XDP_PACKET_HEADROOM (256) + MLX5E_RX_MAX_HEAD (256) + sizeof(struct skb_shared_info), so only 256 is available to xdp programs to pull data in. > > > > > > In general, I think the code would be nicer if it would do a rewind of > > > the end pointer based on the diff between the old and new nr_frags. > > > > > > > Not sure if I get this. Do you mean calling __pskb_pull_tail() some > > how based on the difference between sinfo->nr_frags and nr_frags? > > > > Thanks for reviewing the patch! > > > I was suggesting an approach for the whole patch that might be cleaner. > > Roll back frag_page to the last used fragment after program execution: > > frag_page -= old_nr_frags - new_nr_frags; > > ... and after that you won't need to touch the frag counting loops > and the xdp_update_skb_shared_info(). > Got it. This makes the change quite cleaner. > Thanks, > Dragos