On 06/24, Jason Xing wrote: > On Mon, Jun 23, 2025 at 10:18 PM Stanislav Fomichev > <stfomichev@xxxxxxxxx> wrote: > > > > On 06/21, Jason Xing wrote: > > > On Sat, Jun 21, 2025 at 12:47 AM Stanislav Fomichev > > > <stfomichev@xxxxxxxxx> wrote: > > > > > > > > On 06/21, Jason Xing wrote: > > > > > On Fri, Jun 20, 2025 at 10:25 PM Stanislav Fomichev > > > > > <stfomichev@xxxxxxxxx> wrote: > > > > > > > > > > > > On 06/19, Jakub Kicinski wrote: > > > > > > > On Thu, 19 Jun 2025 17:04:40 +0800 Jason Xing wrote: > > > > > > > > @@ -424,7 +421,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc) > > > > > > > > rcu_read_lock(); > > > > > > > > again: > > > > > > > > list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) { > > > > > > > > - if (xs->tx_budget_spent >= MAX_PER_SOCKET_BUDGET) { > > > > > > > > + int max_budget = READ_ONCE(xs->max_tx_budget); > > > > > > > > + > > > > > > > > + if (xs->tx_budget_spent >= max_budget) { > > > > > > > > budget_exhausted = true; > > > > > > > > continue; > > > > > > > > } > > > > > > > > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > > > > > > > > static int __xsk_generic_xmit(struct sock *sk) > > > > > > > > { > > > > > > > > struct xdp_sock *xs = xdp_sk(sk); > > > > > > > > - u32 max_batch = TX_BATCH_SIZE; > > > > > > > > + u32 max_budget = READ_ONCE(xs->max_tx_budget); > > > > > > > > > > > > > > Hm, maybe a question to Stan / Willem & other XSK experts but are these > > > > > > > two max values / code paths really related? Question 2 -- is generic > > > > > > > XSK a legit optimization target, legit enough to add uAPI? > > > > > > > > > > > > 1) xsk_tx_peek_desc is for zc case and xsk_build_skb is copy mode; > > > > > > whether we want to affect zc case given the fact that Jason seemingly > > > > > > cares about copy mode is a good question. > > > > > > > > > > Allow me to ask the similar question that you asked me before: even though I > > > > > didn't see the necessity to set the max budget for zc mode (just > > > > > because I didn't spot it happening), would it be better if we separate > > > > > both of them because it's an uAPI interface. IIUC, if the setsockopt > > > > > is set, we will not separate it any more in the future? > > > > > > > > > > We can keep using the hardcoded value (32) in the zc mode like > > > > > before and __only__ touch the copy mode? Later if someone or I found > > > > > the significance of making it tunable, then another parameter of > > > > > setsockopt can be added? Does it make sense? > > > > > > > > Related suggestion: maybe we don't need this limit at all for the copy mode? > > > > If the user, with a socket option, can arbitrarily change it, what is the > > > > point of this limit? Keep it on the zc side to make sure one socket doesn't > > > > starve the rest and drop from the copy mode.. Any reason not to do it? > > > > > > Thanks for bringing up the same question that I had in this thread. I > > > saw the commit[1] mentioned it is used to avoid the burst as DPDK > > > does, so my thought is that it might be used to prevent such a case > > > where multiple sockets try to send packets through a shared umem > > > nearly at the same time? > > > > > > Making it tunable is to provide a chance to let users seek for a good > > > solution that is the best fit for them. It doesn't mean we > > > allow/expect to see the burst situation. > > > > The users can choose to moderate their batches by submitting less > > with each sendmsg call. I see why having a batch limit might be useful for > > zerocopy to tx in batches to interleave multiple sockets, but not > > sure how this limit helps for the copy mode. Since we are not running > > qdisc layer on tx, we don't really have a good answer for multiple > > sockets sharing the same device/queue.. > > It's worth mentioning that the xsk still holds the tx queue lock in > the non-zc mode. So I assume getting rid of the limit might be harmful > for other non xsk flows. That is what I know about the burst concern. But one still needs NET_RAW to use it, right? So it's not like some random process will suddenly start ddos-ing tx queues.. Maybe we should add need_resched() / signal_pending() to the loop to break it?