On Thu, Jul 3, 2025 at 9:09 PM Jason Xing <kerneljasonxing@xxxxxxxxx> wrote: > > Hi Maciej, > > Thanks for the review. > > On Thu, Jul 3, 2025 at 8:26 PM Maciej Fijalkowski > <maciej.fijalkowski@xxxxxxxxx> wrote: > > > > On Fri, Jun 27, 2025 at 07:01:21PM +0800, Jason Xing wrote: > > > From: Jason Xing <kernelxing@xxxxxxxxxxx> > > > > > > This patch provides a setsockopt method to let applications leverage to > > > adjust how many descs to be handled at most in one send syscall. It > > > mitigates the situation where the default value (32) that is too small > > > leads to higher frequency of triggering send syscall. > > > > > > Considering the prosperity/complexity the applications have, there is no > > > absolutely ideal suggestion fitting all cases. So keep 32 as its default > > > value like before. > > > > > > The patch does the following things: > > > - Add XDP_MAX_TX_BUDGET socket option. > > > - Convert TX_BATCH_SIZE to tx_budget_spent. > > > - Set tx_budget_spent to 32 by default in the initialization phase as a > > > per-socket granular control. 32 is also the min value for > > > tx_budget_spent. > > > - Set the range of tx_budget_spent as [32, xs->tx->nentries]. > > > > > > The idea behind this comes out of real workloads in production. We use a > > > user-level stack with xsk support to accelerate sending packets and > > > minimize triggering syscalls. When the packets are aggregated, it's not > > > hard to hit the upper bound (namely, 32). The moment user-space stack > > > fetches the -EAGAIN error number passed from sendto(), it will loop to try > > > again until all the expected descs from tx ring are sent out to the driver. > > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of > > > sendto() and higher throughput/PPS. > > > > > > Here is what I did in production, along with some numbers as follows: > > > For one application I saw lately, I suggested using 128 as max_tx_budget > > > because I saw two limitations without changing any default configuration: > > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by > > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind > > > this was I counted how many descs are transmitted to the driver at one > > > time of sendto() based on [1] patch and then I calculated the > > > possibility of hitting the upper bound. Finally I chose 128 as a > > > suitable value because 1) it covers most of the cases, 2) a higher > > > number would not bring evident results. After twisting the parameters, > > > a stable improvement of around 4% for both PPS and throughput and less > > > resources consumption were found to be observed by strace -c -p xxx: > > > 1) %time was decreased by 7.8% > > > 2) error counter was decreased from 18367 to 572 > > > > > > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@xxxxxxxxx/ > > > > > > Signed-off-by: Jason Xing <kernelxing@xxxxxxxxxxx> > > > --- > > > v6 > > > Link: https://lore.kernel.org/all/20250625123527.98209-1-kerneljasonxing@xxxxxxxxx/ > > > 1. use [32, xs->tx->nentries] range > > > 2. Since setsockopt may generate a different value, add getsockopt to help > > > application know what value takes effect finally. > > > > > > v5 > > > Link: https://lore.kernel.org/all/20250623021345.69211-1-kerneljasonxing@xxxxxxxxx/ > > > 1. remove changes around zc mode > > > > > > v4 > > > Link: https://lore.kernel.org/all/20250619090440.65509-1-kerneljasonxing@xxxxxxxxx/ > > > 1. remove getsockopt as it seems no real use case. > > > 2. adjust the position of max_tx_budget to make sure it stays with other > > > read-most fields in one cacheline. > > > 3. set one as the lower bound of max_tx_budget > > > 4. add more descriptions/performance data in Doucmentation and commit message. > > > > > > V3 > > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@xxxxxxxxx/ > > > 1. use a per-socket control (suggested by Stanislav) > > > 2. unify both definitions into one > > > 3. support setsockopt and getsockopt > > > 4. add more description in commit message > > > > > > V2 > > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@xxxxxxxxx/ > > > 1. use a per-netns sysctl knob > > > 2. use sysctl_xsk_max_tx_budget to unify both definitions. > > > --- > > > Documentation/networking/af_xdp.rst | 9 +++++++ > > > include/net/xdp_sock.h | 1 + > > > include/uapi/linux/if_xdp.h | 1 + > > > net/xdp/xsk.c | 39 ++++++++++++++++++++++++++--- > > > tools/include/uapi/linux/if_xdp.h | 1 + > > > 5 files changed, 47 insertions(+), 4 deletions(-) > > > > > > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst > > > index dceeb0d763aa..253afee00162 100644 > > > --- a/Documentation/networking/af_xdp.rst > > > +++ b/Documentation/networking/af_xdp.rst > > > @@ -442,6 +442,15 @@ is created by a privileged process and passed to a non-privileged one. > > > Once the option is set, kernel will refuse attempts to bind that socket > > > to a different interface. Updating the value requires CAP_NET_RAW. > > > > > > +XDP_MAX_TX_BUDGET setsockopt > > > +---------------------------- > > > + > > > +This setsockopt sets the maximum number of descriptors that can be handled > > > +and passed to the driver at one send syscall. It is applied in the non-zero > > > > just 'copy mode' would be enough IMHO. > > Okay. > > > > > > +copy mode to allow application to tune the per-socket maximum iteration for > > > +better throughput and less frequency of send syscall. > > > +Allowed range is [32, xs->tx->nentries]. > > > + > > > XDP_STATISTICS getsockopt > > > ------------------------- > > > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h > > > index e8bd6ddb7b12..ce587a225661 100644 > > > --- a/include/net/xdp_sock.h > > > +++ b/include/net/xdp_sock.h > > > @@ -84,6 +84,7 @@ struct xdp_sock { > > > struct list_head map_list; > > > /* Protects map_list */ > > > spinlock_t map_list_lock; > > > + u32 max_tx_budget; > > > /* Protects multiple processes in the control path */ > > > struct mutex mutex; > > > struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */ > > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h > > > index 44f2bb93e7e6..07c6d21c2f1c 100644 > > > --- a/include/uapi/linux/if_xdp.h > > > +++ b/include/uapi/linux/if_xdp.h > > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets { > > > #define XDP_UMEM_COMPLETION_RING 6 > > > #define XDP_STATISTICS 7 > > > #define XDP_OPTIONS 8 > > > +#define XDP_MAX_TX_BUDGET 9 > > > > > > struct xdp_umem_reg { > > > __u64 addr; /* Start of packet data area */ > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > > index 72c000c0ae5f..41efe7b27b0e 100644 > > > --- a/net/xdp/xsk.c > > > +++ b/net/xdp/xsk.c > > > @@ -33,8 +33,8 @@ > > > #include "xdp_umem.h" > > > #include "xsk.h" > > > > > > -#define TX_BATCH_SIZE 32 > > > -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE) > > > +#define TX_BUDGET_SIZE 32 > > > +#define MAX_PER_SOCKET_BUDGET 32 > > > > that could have stayed as it was before? > > Sorry, I don't think so because one definition has nothing to do with > the other one. Why do we need to bind them together? > > > > > > > > > void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) > > > { > > > @@ -779,7 +779,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > > > static int __xsk_generic_xmit(struct sock *sk) > > > { > > > struct xdp_sock *xs = xdp_sk(sk); > > > - u32 max_batch = TX_BATCH_SIZE; > > > + u32 max_budget = READ_ONCE(xs->max_tx_budget); > > > > you're breaking RCT here... > > What's the RCT? Never heard of this abbreviation. > > > maybe you could READ_ONCE() the budget right > > before the while() loop and keep the declaration only at the top of func? > > > > also nothing wrong with keeping @max_batch as a name for this budget. > > > > > bool sent_frame = false; > > > struct xdp_desc desc; > > > struct sk_buff *skb; > > > @@ -797,7 +797,7 @@ static int __xsk_generic_xmit(struct sock *sk) > > > goto out; > > > > > > while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { > > > - if (max_batch-- == 0) { > > > + if (max_budget-- == 0) { > > > err = -EAGAIN; > > > goto out; > > > } > > > @@ -1437,6 +1437,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname, > > > mutex_unlock(&xs->mutex); > > > return err; > > > } > > > + case XDP_MAX_TX_BUDGET: > > > + { > > > + unsigned int budget; > > > + > > > + if (optlen != sizeof(budget)) > > > + return -EINVAL; > > > + if (copy_from_sockptr(&budget, optval, sizeof(budget))) > > > + return -EFAULT; > > > + if (!xs->tx || xs->tx->nentries < TX_BUDGET_SIZE) > > > + return -EACCES; > > > + > > > + WRITE_ONCE(xs->max_tx_budget, > > > + clamp(budget, TX_BUDGET_SIZE, xs->tx->nentries)); > > > > I think it would be better to throw errno when budget set by user is > > bigger than nentries. you do it for min case, let's do it for max case as > > well and skip the clamp() altogether? > > Okay. > > After this, do you think it's necessary to keep the getsockopt()? > > > > > this is rather speculative that someone would ever set such a big budget > > but silently setting a lower value than provided might be confusing to me. > > i'd rather get an error thrown at my face and find out the valid range. > > On point. > > > > > > + return 0; > > > + } > > > default: > > > break; > > > } > > > @@ -1588,6 +1603,21 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname, > > > > > > return 0; > > > } > > > + case XDP_MAX_TX_BUDGET: > > > + { > > > + unsigned int budget; > > > + > > > + if (len < sizeof(budget)) > > > + return -EINVAL; > > > + > > > + budget = READ_ONCE(xs->max_tx_budget); > > > + if (copy_to_user(optval, &budget, sizeof(budget))) > > > + return -EFAULT; > > > + if (put_user(sizeof(budget), optlen)) > > > + return -EFAULT; > > > + > > > + return 0; > > > + } > > > default: > > > break; > > > } > > > @@ -1734,6 +1764,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol, > > > > > > xs = xdp_sk(sk); > > > xs->state = XSK_READY; > > > + xs->max_tx_budget = TX_BUDGET_SIZE; > > > mutex_init(&xs->mutex); > > > > > > INIT_LIST_HEAD(&xs->map_list); > > > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h > > > index 44f2bb93e7e6..07c6d21c2f1c 100644 > > > --- a/tools/include/uapi/linux/if_xdp.h > > > +++ b/tools/include/uapi/linux/if_xdp.h > > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets { > > > #define XDP_UMEM_COMPLETION_RING 6 > > > #define XDP_STATISTICS 7 > > > #define XDP_OPTIONS 8 > > > +#define XDP_MAX_TX_BUDGET 9 > > > > could we have 'SKB' in name? > > Like XDP_MAX_TX_BUDGET_SKB ? XDP_MAX_TX_SKB_BUDGET sounds better, I assume. Thanks, Jason