On 16/04/2025 14.44, Toshiaki Makita wrote:
On 2025/04/15 22:45, Jesper Dangaard Brouer wrote:
In production, we're seeing TX drops on veth devices when the ptr_ring
fills up. This can occur when NAPI mode is enabled, though it's
relatively rare. However, with threaded NAPI - which we use in
production - the drops become significantly more frequent.
The underlying issue is that with threaded NAPI, the consumer often runs
on a different CPU than the producer. This increases the likelihood of
the ring filling up before the consumer gets scheduled, especially under
load, leading to drops in veth_xmit() (ndo_start_xmit()).
This patch introduces backpressure by returning NETDEV_TX_BUSY when the
ring is full, signaling the qdisc layer to requeue the packet. The txq
(netdev queue) is stopped in this condition and restarted once
veth_poll() drains entries from the ring, ensuring coordination between
NAPI and qdisc.
Backpressure is only enabled when a qdisc is attached. Without a qdisc,
the driver retains its original behavior - dropping packets immediately
when the ring is full. This avoids unexpected behavior changes in setups
without a configured qdisc.
With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
(AQM) to fairly schedule packets across flows and reduce collateral
damage from elephant flows.
A known limitation of this approach is that the full ring sits in front
of the qdisc layer, effectively forming a FIFO buffer that introduces
base latency. While AQM still improves fairness and mitigates flow
dominance, the latency impact is measurable.
In hardware drivers, this issue is typically addressed using BQL (Byte
Queue Limits), which tracks in-flight bytes needed based on physical link
rate. However, for virtual drivers like veth, there is no fixed bandwidth
constraint - the bottleneck is CPU availability and the scheduler's
ability
to run the NAPI thread. It is unclear how effective BQL would be in this
context.
This patch serves as a first step toward addressing TX drops. Future work
may explore adapting a BQL-like mechanism to better suit virtual devices
like veth.
Thank you for the patch.
Reported-by: Yan Zhai <yan@xxxxxxxxxxxxxx>
Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
---
drivers/net/veth.c | 49
+++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 41 insertions(+), 8 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7bb53961c0ea..a419d5e198d8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -308,11 +308,10 @@ static void __veth_xdp_flush(struct veth_rq *rq)
static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
{
if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
- dev_kfree_skb_any(skb);
- return NET_RX_DROP;
+ return NETDEV_TX_BUSY; /* signal qdisc layer */
}
You don't need this braces any more?
if (...)
return ...;
Correct, fixed for V5.
- return NET_RX_SUCCESS;
+ return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
}
static int veth_forward_skb(struct net_device *dev, struct sk_buff
*skb,
@@ -346,11 +345,11 @@ static netdev_tx_t veth_xmit(struct sk_buff
*skb, struct net_device *dev)
{
struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct veth_rq *rq = NULL;
- int ret = NETDEV_TX_OK;
+ struct netdev_queue *txq;
struct net_device *rcv;
int length = skb->len;
bool use_napi = false;
- int rxq;
+ int ret, rxq;
rcu_read_lock();
rcv = rcu_dereference(priv->peer);
@@ -373,17 +372,41 @@ static netdev_tx_t veth_xmit(struct sk_buff
*skb, struct net_device *dev)
}
skb_tx_timestamp(skb);
- if (likely(veth_forward_skb(rcv, skb, rq, use_napi) ==
NET_RX_SUCCESS)) {
+
+ ret = veth_forward_skb(rcv, skb, rq, use_napi);
+ switch(ret) {
+ case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
if (!use_napi)
dev_sw_netstats_tx_add(dev, 1, length);
else
__veth_xdp_flush(rq);
- } else {
+ break;
+ case NETDEV_TX_BUSY:
+ /* If a qdisc is attached to our virtual device, returning
+ * NETDEV_TX_BUSY is allowed.
+ */
+ txq = netdev_get_tx_queue(dev, rxq);
+
+ if (qdisc_txq_has_no_queue(txq)) {
+ dev_kfree_skb_any(skb);
+ goto drop;
+ }
+ netif_tx_stop_queue(txq);
+ /* Restore Eth hdr pulled by dev_forward_skb/eth_type_trans */
+ __skb_push(skb, ETH_HLEN);
+ if (use_napi)
+ __veth_xdp_flush(rq);
You did not add a packet to the ring.
No need for flush here?
IMHO we do need a flush here.
This is related to the netif_tx_stop_queue(txq) call, that stops the
TXQ, and that need to be started again by NAPI side.
This is need to handle a very unlikely race, but if the race happens
then it can cause the TXQ to stay stopped (blocking all traffic).
Given we arrive at NETDEV_TX_BUSY, when ptr_ring is full, it is very
likely that someone else have called flush and NAPI veth_poll is
running. Thus, the extra flush will likely be a no-op as
rx_notify_masked is true.
The race is that before calling netif_tx_stop_queue(txq) the other CPU
running NAPI veth_poll manages to NAPI complete and empty the ptr_ring.
In this case, the flush will avoid race, as it will have an effect as
rx_notify_masked will be false.
Looking closer at code: There is still a possible race, in veth_poll,
after calling veth_xdp_rcv() and until rq->rx_notify_masked is set to
false (via smp_store_mb). If netif_tx_stop_queue(txq) is executed in
this window, then we still have the race, where TXQ stays stopped
forever. (There is a optional call to xdp_do_flush that can increase
race window).
I'll add something in V5 that also handles the second race window.
--Jesper