Re: [PATCH net-next v3] net: xsk: introduce XDP_MAX_TX_BUDGET set/getsockopt

Stanislav Fomichev <stfomichev@xxxxxxxxx> · Mon, 23 Jun 2025 17:48:09 -0700

On 06/24, Jason Xing wrote:
> On Mon, Jun 23, 2025 at 10:18 PM Stanislav Fomichev
> <stfomichev@xxxxxxxxx> wrote:
> >
> > On 06/21, Jason Xing wrote:
> > > On Sat, Jun 21, 2025 at 12:47 AM Stanislav Fomichev
> > > <stfomichev@xxxxxxxxx> wrote:
> > > >
> > > > On 06/21, Jason Xing wrote:
> > > > > On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev
> > > > > <stfomichev@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On 06/19, Jakub Kicinski wrote:
> > > > > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote:
> > > > > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> > > > > > > >     rcu_read_lock();
> > > > > > > >  again:
> > > > > > > >     list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) {
> > > > > > > > -           if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) {
> > > > > > > > +           int max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > > > +
> > > > > > > > +           if (xs->tx_budget_spent >= max_budget) {
> > > > > > > >                     budget_exhausted = true;
> > > > > > > >                     continue;
> > > > > > > >             }
> > > > > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > > > > >  static int __xsk_generic_xmit(struct sock *sk)
> > > > > > > >  {
> > > > > > > >     struct xdp_sock *xs = xdp_sk(sk);
> > > > > > > > -   u32 max_batch = TX_BATCH_SIZE;
> > > > > > > > +   u32 max_budget = READ_ONCE(xs->max_tx_budget);
> > > > > > >
> > > > > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these
> > > > > > > two max values / code paths really related? Question 2 -- is generic
> > > > > > > XSK a legit optimization target, legit enough to add uAPI?
> > > > > >
> > > > > > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode;
> > > > > > whether we want to affect zc case given the fact that Jason seemingly
> > > > > > cares about copy mode is a good question.
> > > > >
> > > > > Allow me to ask the similar question that you asked me before: even though I
> > > > > didn't see the necessity to set the max budget for zc mode (just
> > > > > because I didn't spot it happening), would it be better if we separate
> > > > > both of them because it's an uAPI interface. IIUC, if the setsockopt
> > > > > is set, we will not separate it any more in the future?
> > > > >
> > > > > We can keep using the hardcoded value (32) in the zc mode like
> > > > > before and __only__ touch the copy mode? Later if someone or I found
> > > > > the significance of making it tunable, then another parameter of
> > > > > setsockopt can be added? Does it make sense?
> > > >
> > > > Related suggestion: maybe we don't need this limit at all for the copy mode?
> > > > If the user, with a socket option, can arbitrarily change it, what is the
> > > > point of this limit? Keep it on the zc side to make sure one socket doesn't
> > > > starve the rest and drop from the copy mode.. Any reason not to do it?
> > >
> > > Thanks for bringing up the same question that I had in this thread. I
> > > saw the commit[1] mentioned it is used to avoid the burst as DPDK
> > > does, so my thought is that it might be used to prevent such a case
> > > where multiple sockets try to send packets through a shared umem
> > > nearly at the same time?
> > >
> > > Making it tunable is to provide a chance to let users seek for a good
> > > solution that is the best fit for them. It doesn't mean we
> > > allow/expect to see the burst situation.
> >
> > The users can choose to moderate their batches by submitting less
> > with each sendmsg call. I see why having a batch limit might be useful for
> > zerocopy to tx in batches to interleave multiple sockets, but not
> > sure how this limit helps for the copy mode. Since we are not running
> > qdisc layer on tx, we don't really have a good answer for multiple
> > sockets sharing the same device/queue..
> 
> It's worth mentioning that the xsk still holds the tx queue lock in
> the non-zc mode. So I assume getting rid of the limit might be harmful
> for other non xsk flows. That is what I know about the burst concern.

But one still needs NET_RAW to use it, right? So it's not like some
random process will suddenly start ddos-ing tx queues.. Maybe we should
add need_resched() / signal_pending() to the loop to break it?