Hi Maciej, Thanks for the review. On Thu, Jul 3, 2025 at 8:26 PM Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> wrote: > > On Fri, Jun 27, 2025 at 07:01:21PM +0800, Jason Xing wrote: > > From: Jason Xing <kernelxing@xxxxxxxxxxx> > > > > This patch provides a setsockopt method to let applications leverage to > > adjust how many descs to be handled at most in one send syscall. It > > mitigates the situation where the default value (32) that is too small > > leads to higher frequency of triggering send syscall. > > > > Considering the prosperity/complexity the applications have, there is no > > absolutely ideal suggestion fitting all cases. So keep 32 as its default > > value like before. > > > > The patch does the following things: > > - Add XDP_MAX_TX_BUDGET socket option. > > - Convert TX_BATCH_SIZE to tx_budget_spent. > > - Set tx_budget_spent to 32 by default in the initialization phase as a > > per-socket granular control. 32 is also the min value for > > tx_budget_spent. > > - Set the range of tx_budget_spent as [32, xs->tx->nentries]. > > > > The idea behind this comes out of real workloads in production. We use a > > user-level stack with xsk support to accelerate sending packets and > > minimize triggering syscalls. When the packets are aggregated, it's not > > hard to hit the upper bound (namely, 32). The moment user-space stack > > fetches the -EAGAIN error number passed from sendto(), it will loop to try > > again until all the expected descs from tx ring are sent out to the driver. > > Enlarging the XDP_MAX_TX_BUDGET value contributes to less frequency of > > sendto() and higher throughput/PPS. > > > > Here is what I did in production, along with some numbers as follows: > > For one application I saw lately, I suggested using 128 as max_tx_budget > > because I saw two limitations without changing any default configuration: > > 1) XDP_MAX_TX_BUDGET, 2) socket sndbuf which is 212992 decided by > > net.core.wmem_default. As to XDP_MAX_TX_BUDGET, the scenario behind > > this was I counted how many descs are transmitted to the driver at one > > time of sendto() based on [1] patch and then I calculated the > > possibility of hitting the upper bound. Finally I chose 128 as a > > suitable value because 1) it covers most of the cases, 2) a higher > > number would not bring evident results. After twisting the parameters, > > a stable improvement of around 4% for both PPS and throughput and less > > resources consumption were found to be observed by strace -c -p xxx: > > 1) %time was decreased by 7.8% > > 2) error counter was decreased from 18367 to 572 > > > > [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@xxxxxxxxx/ > > > > Signed-off-by: Jason Xing <kernelxing@xxxxxxxxxxx> > > --- > > v6 > > Link: https://lore.kernel.org/all/20250625123527.98209-1-kerneljasonxing@xxxxxxxxx/ > > 1. use [32, xs->tx->nentries] range > > 2. Since setsockopt may generate a different value, add getsockopt to help > > application know what value takes effect finally. > > > > v5 > > Link: https://lore.kernel.org/all/20250623021345.69211-1-kerneljasonxing@xxxxxxxxx/ > > 1. remove changes around zc mode > > > > v4 > > Link: https://lore.kernel.org/all/20250619090440.65509-1-kerneljasonxing@xxxxxxxxx/ > > 1. remove getsockopt as it seems no real use case. > > 2. adjust the position of max_tx_budget to make sure it stays with other > > read-most fields in one cacheline. > > 3. set one as the lower bound of max_tx_budget > > 4. add more descriptions/performance data in Doucmentation and commit message. > > > > V3 > > Link: https://lore.kernel.org/all/20250618065553.96822-1-kerneljasonxing@xxxxxxxxx/ > > 1. use a per-socket control (suggested by Stanislav) > > 2. unify both definitions into one > > 3. support setsockopt and getsockopt > > 4. add more description in commit message > > > > V2 > > Link: https://lore.kernel.org/all/20250617002236.30557-1-kerneljasonxing@xxxxxxxxx/ > > 1. use a per-netns sysctl knob > > 2. use sysctl_xsk_max_tx_budget to unify both definitions. > > --- > > Documentation/networking/af_xdp.rst | 9 +++++++ > > include/net/xdp_sock.h | 1 + > > include/uapi/linux/if_xdp.h | 1 + > > net/xdp/xsk.c | 39 ++++++++++++++++++++++++++--- > > tools/include/uapi/linux/if_xdp.h | 1 + > > 5 files changed, 47 insertions(+), 4 deletions(-) > > > > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst > > index dceeb0d763aa..253afee00162 100644 > > --- a/Documentation/networking/af_xdp.rst > > +++ b/Documentation/networking/af_xdp.rst > > @@ -442,6 +442,15 @@ is created by a privileged process and passed to a non-privileged one. > > Once the option is set, kernel will refuse attempts to bind that socket > > to a different interface. Updating the value requires CAP_NET_RAW. > > > > +XDP_MAX_TX_BUDGET setsockopt > > +---------------------------- > > + > > +This setsockopt sets the maximum number of descriptors that can be handled > > +and passed to the driver at one send syscall. It is applied in the non-zero > > just 'copy mode' would be enough IMHO. Okay. > > > +copy mode to allow application to tune the per-socket maximum iteration for > > +better throughput and less frequency of send syscall. > > +Allowed range is [32, xs->tx->nentries]. > > + > > XDP_STATISTICS getsockopt > > ------------------------- > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h > > index e8bd6ddb7b12..ce587a225661 100644 > > --- a/include/net/xdp_sock.h > > +++ b/include/net/xdp_sock.h > > @@ -84,6 +84,7 @@ struct xdp_sock { > > struct list_head map_list; > > /* Protects map_list */ > > spinlock_t map_list_lock; > > + u32 max_tx_budget; > > /* Protects multiple processes in the control path */ > > struct mutex mutex; > > struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */ > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h > > index 44f2bb93e7e6..07c6d21c2f1c 100644 > > --- a/include/uapi/linux/if_xdp.h > > +++ b/include/uapi/linux/if_xdp.h > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets { > > #define XDP_UMEM_COMPLETION_RING 6 > > #define XDP_STATISTICS 7 > > #define XDP_OPTIONS 8 > > +#define XDP_MAX_TX_BUDGET 9 > > > > struct xdp_umem_reg { > > __u64 addr; /* Start of packet data area */ > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > index 72c000c0ae5f..41efe7b27b0e 100644 > > --- a/net/xdp/xsk.c > > +++ b/net/xdp/xsk.c > > @@ -33,8 +33,8 @@ > > #include "xdp_umem.h" > > #include "xsk.h" > > > > -#define TX_BATCH_SIZE 32 > > -#define MAX_PER_SOCKET_BUDGET (TX_BATCH_SIZE) > > +#define TX_BUDGET_SIZE 32 > > +#define MAX_PER_SOCKET_BUDGET 32 > > that could have stayed as it was before? Sorry, I don't think so because one definition has nothing to do with the other one. Why do we need to bind them together? > > > > > void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) > > { > > @@ -779,7 +779,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > > static int __xsk_generic_xmit(struct sock *sk) > > { > > struct xdp_sock *xs = xdp_sk(sk); > > - u32 max_batch = TX_BATCH_SIZE; > > + u32 max_budget = READ_ONCE(xs->max_tx_budget); > > you're breaking RCT here... What's the RCT? Never heard of this abbreviation. > maybe you could READ_ONCE() the budget right > before the while() loop and keep the declaration only at the top of func? > > also nothing wrong with keeping @max_batch as a name for this budget. > > > bool sent_frame = false; > > struct xdp_desc desc; > > struct sk_buff *skb; > > @@ -797,7 +797,7 @@ static int __xsk_generic_xmit(struct sock *sk) > > goto out; > > > > while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { > > - if (max_batch-- == 0) { > > + if (max_budget-- == 0) { > > err = -EAGAIN; > > goto out; > > } > > @@ -1437,6 +1437,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname, > > mutex_unlock(&xs->mutex); > > return err; > > } > > + case XDP_MAX_TX_BUDGET: > > + { > > + unsigned int budget; > > + > > + if (optlen != sizeof(budget)) > > + return -EINVAL; > > + if (copy_from_sockptr(&budget, optval, sizeof(budget))) > > + return -EFAULT; > > + if (!xs->tx || xs->tx->nentries < TX_BUDGET_SIZE) > > + return -EACCES; > > + > > + WRITE_ONCE(xs->max_tx_budget, > > + clamp(budget, TX_BUDGET_SIZE, xs->tx->nentries)); > > I think it would be better to throw errno when budget set by user is > bigger than nentries. you do it for min case, let's do it for max case as > well and skip the clamp() altogether? Okay. After this, do you think it's necessary to keep the getsockopt()? > > this is rather speculative that someone would ever set such a big budget > but silently setting a lower value than provided might be confusing to me. > i'd rather get an error thrown at my face and find out the valid range. On point. > > > + return 0; > > + } > > default: > > break; > > } > > @@ -1588,6 +1603,21 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname, > > > > return 0; > > } > > + case XDP_MAX_TX_BUDGET: > > + { > > + unsigned int budget; > > + > > + if (len < sizeof(budget)) > > + return -EINVAL; > > + > > + budget = READ_ONCE(xs->max_tx_budget); > > + if (copy_to_user(optval, &budget, sizeof(budget))) > > + return -EFAULT; > > + if (put_user(sizeof(budget), optlen)) > > + return -EFAULT; > > + > > + return 0; > > + } > > default: > > break; > > } > > @@ -1734,6 +1764,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol, > > > > xs = xdp_sk(sk); > > xs->state = XSK_READY; > > + xs->max_tx_budget = TX_BUDGET_SIZE; > > mutex_init(&xs->mutex); > > > > INIT_LIST_HEAD(&xs->map_list); > > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h > > index 44f2bb93e7e6..07c6d21c2f1c 100644 > > --- a/tools/include/uapi/linux/if_xdp.h > > +++ b/tools/include/uapi/linux/if_xdp.h > > @@ -79,6 +79,7 @@ struct xdp_mmap_offsets { > > #define XDP_UMEM_COMPLETION_RING 6 > > #define XDP_STATISTICS 7 > > #define XDP_OPTIONS 8 > > +#define XDP_MAX_TX_BUDGET 9 > > could we have 'SKB' in name? Like XDP_MAX_TX_BUDGET_SKB ? Thanks, Jason > > > > > struct xdp_umem_reg { > > __u64 addr; /* Start of packet data area */ > > -- > > 2.41.3 > >