Jason Xing wrote: > From: Jason Xing <kernelxing@xxxxxxxxxxx> > > This patch provides a setsockopt method to let applications leverage to > adjust how many descs to be handled at most in one send syscall. It > mitigates the situation where the default value (32) that is too small > leads to higher frequency of triggering send syscall. > > Considering the prosperity/complexity the applications have, there is no > absolutely ideal suggestion fitting all cases. So keep 32 as its default > value like before. > > The patch does the following things: > - Add XDP_MAX_TX_BUDGET socket option. > - Convert TX_BATCH_SIZE tx_budget_spent. > - Set tx_budget_spent to 32 by default in the initialization phase as a > per-socket granular control. > > The idea behind this comes out of real workloads in production. We use a > user-level stack with xsk support to accelerate sending packets and > minimize triggering syscalls. When the packets are aggregated, it's not > hard to hit the upper bound (namely, 32). The moment user-space stack > fetches the -EAGAIN error number passed from sendto(), it will loop to try > again until all the expected descs from tx ring are sent out to the driver. > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of > sendto() and higher throughput/PPS. > > Here is what I did in production, along with some numbers as follows: > For one application I saw lately, I suggested using 128 as max_tx_budget > because I saw two limitations without changing any default configuration: > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind > this was I counted how many descs are transmitted to the driver at one > time of sendto() based on [1] patch and then I calculated the > possibility of hitting the upper bound. Finally I chose 128 as a > suitable value because 1) it covers most of the cases, 2) a higher > number would not bring evident results. After twisting the parameters, > a stable improvement of around 4% for both PPS and throughput and less > resources consumption were found to be observed by strace -c -p xxx: > 1) %time was decreased by 7.8% > 2) error counter was decreased from 18367 to 572 > > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@xxxxxxxxx/ > > Signed-off-by: Jason Xing <kernelxing@xxxxxxxxxxx> > --- > v5 > Link: https://lore.kernel.org/all/20250623021345.69211-1-kerneljasonxing@xxxxxxxxx/ > 1. remove changes around zc mode > > v4 > Link: https://lore.kernel.org/all/20250619090440.65509-1-kerneljasonxing@xxxxxxxxx/ > 1. remove getsockopt as it seems no real use case. > 2. adjust the position of max_tx_budget to make sure it stays with other > read-most fields in one cacheline. > 3. set one as the lower bound of max_tx_budget > 4. add more descriptions/performance data in Doucmentation and commit message. > > V3 > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@xxxxxxxxx/ > 1. use a per-socket control (suggested by Stanislav) > 2. unify both definitions into one > 3. support setsockopt and getsockopt > 4. add more description in commit message > > V2 > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@xxxxxxxxx/ > 1. use a per-netns sysctl knob > 2. use sysctl_xsk_max_tx_budget to unify both definitions. > --- > Documentation/networking/af_xdp.rst | 8 ++++++++ > include/net/xdp_sock.h | 1 + > include/uapi/linux/if_xdp.h | 1 + > net/xdp/xsk.c | 20 ++++++++++++++++---- > tools/include/uapi/linux/if_xdp.h | 1 + > 5 files changed, 27 insertions(+), 4 deletions(-) > > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst > index dceeb0d763aa..9eb6f7b630a5 100644 > --- a/Documentation/networking/af_xdp.rst > +++ b/Documentation/networking/af_xdp.rst > @@ -442,6 +442,14 @@ is created by a privileged process and passed to a non-privileged one. > Once the option is set, kernel will refuse attempts to bind that socket > to a different interface. Updating the value requires CAP_NET_RAW. > > +XDP_MAX_TX_BUDGET setsockopt > +---------------------------- > + > +This setsockopt sets the maximum number of descriptors that can be handled > +and passed to the driver at one send syscall. It is applied in the non-zero > +copy mode to allow application to tune the per-socket maximum iteration for > +better throughput and less frequency of send syscall. Default is 32. > + > XDP_STATISTICS getsockopt > ------------------------- > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h > index e8bd6ddb7b12..ce587a225661 100644 > --- a/include/net/xdp_sock.h > +++ b/include/net/xdp_sock.h > @@ -84,6 +84,7 @@ struct xdp_sock { > struct list_head map_list; > /* Protects map_list */ > spinlock_t map_list_lock; > + u32 max_tx_budget; > /* Protects multiple processes in the control path */ > struct mutex mutex; > struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */ > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h > index 44f2bb93e7e6..07c6d21c2f1c 100644 > --- a/include/uapi/linux/if_xdp.h > +++ b/include/uapi/linux/if_xdp.h > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets { > #define XDP_UMEM_COMPLETION_RING 6 > #define XDP_STATISTICS 7 > #define XDP_OPTIONS 8 > +#define XDP_MAX_TX_BUDGET 9 > > struct xdp_umem_reg { > __u64 addr; /* Start of packet data area */ > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 72c000c0ae5f..97aded3555c1 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -33,8 +33,7 @@ > #include "xdp_umem.h" > #include "xsk.h" > > -#define TX_BATCH_SIZE 32 > -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE) > +#define MAX_PER_SOCKET_BUDGET 32 > > void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) > { > @@ -779,7 +778,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > static int __xsk_generic_xmit(struct sock *sk) > { > struct xdp_sock *xs = xdp_sk(sk); > - u32 max_batch = TX_BATCH_SIZE; > + u32 max_budget = READ_ONCE(xs->max_tx_budget); > bool sent_frame = false; > struct xdp_desc desc; > struct sk_buff *skb; > @@ -797,7 +796,7 @@ static int __xsk_generic_xmit(struct sock *sk) > goto out; > > while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { > - if (max_batch-- == 0) { > + if (max_budget-- == 0) { > err = -EAGAIN; > goto out; > } > @@ -1437,6 +1436,18 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname, > mutex_unlock(&xs->mutex); > return err; > } > + case XDP_MAX_TX_BUDGET: > + { > + unsigned int budget; > + > + if (optlen != sizeof(budget)) > + return -EINVAL; > + if (copy_from_sockptr(&budget, optval, sizeof(budget))) > + return -EFAULT; > + > + WRITE_ONCE(xs->max_tx_budget, max(budget, 1)); I still think that this needs a more sane upper bound than U32_MAX. One limiting factor is the XSK TxQ length. At least it should be possible to fail if trying to set beyond that. > + return 0; > + } > default: > break; > } > @@ -1734,6 +1745,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol, > > xs = xdp_sk(sk); > xs->state = XSK_READY; > + xs->max_tx_budget = 32; > mutex_init(&xs->mutex); > > INIT_LIST_HEAD(&xs->map_list); > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h > index 44f2bb93e7e6..07c6d21c2f1c 100644 > --- a/tools/include/uapi/linux/if_xdp.h > +++ b/tools/include/uapi/linux/if_xdp.h > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets { > #define XDP_UMEM_COMPLETION_RING 6 > #define XDP_STATISTICS 7 > #define XDP_OPTIONS 8 > +#define XDP_MAX_TX_BUDGET 9 > > struct xdp_umem_reg { > __u64 addr; /* Start of packet data area */ > -- > 2.41.3 >