Re: [RFC PATCH net-next] veth: apply qdisc backpressure on full ptr_ring to reduce TX drops

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Mon, 07 Apr 2025 11:15:00 +0200

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> writes:

> In production, we're seeing TX drops on veth devices when the ptr_ring
> fills up. This can occur when NAPI mode is enabled, though it's
> relatively rare. However, with threaded NAPI - which we use in
> production - the drops become significantly more frequent.
>
> The underlying issue is that with threaded NAPI, the consumer often runs
> on a different CPU than the producer. This increases the likelihood of
> the ring filling up before the consumer gets scheduled, especially under
> load, leading to drops in veth_xmit() (ndo_start_xmit()).
>
> This patch introduces backpressure by returning NETDEV_TX_BUSY when the
> ring is full, signaling the qdisc layer to requeue the packet. The txq
> (netdev queue) is stopped in this condition and restarted once
> veth_poll() drains entries from the ring, ensuring coordination between
> NAPI and qdisc.

Right, I definitely agree that this is the right solution; having no
backpressure and a fixed-size ringbuffer is obviously not ideal.

> Backpressure is only enabled when a qdisc is attached. Without a qdisc,
> the driver retains its original behavior - dropping packets immediately
> when the ring is full. This avoids unexpected behavior changes in setups
> without a configured qdisc.

Not sure I like this bit, though; see below.

> With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
> (AQM) to fairly schedule packets across flows and reduce collateral
> damage from elephant flows.
>
> A known limitation of this approach is that the full ring sits in front
> of the qdisc layer, effectively forming a FIFO buffer that introduces
> base latency. While AQM still improves fairness and mitigates flow
> dominance, the latency impact is measurable.
>
> In hardware drivers, this issue is typically addressed using BQL (Byte
> Queue Limits), which tracks in-flight bytes needed based on physical link
> rate. However, for virtual drivers like veth, there is no fixed bandwidth
> constraint - the bottleneck is CPU availability and the scheduler's ability
> to run the NAPI thread. It is unclear how effective BQL would be in this
> context.

So the BQL algorithm tries to tune the maximum number of outstanding
bytes to be ~twice the maximum that can be completed in one batch. Since
we're not really limited by bytes in the same sense here (as you point
out), an approximate equivalent would be the NAPI budget, I guess? I.e.,
as a first approximation, we could have veth stop the queue once the
ringbuffer has 2x the NAPI budget packets in it?

> This patch serves as a first step toward addressing TX drops. Future work
> may explore adapting a BQL-like mechanism to better suit virtual devices
> like veth.
>
> Reported-by: Yan Zhai <yan@xxxxxxxxxxxxxx>
> Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>

[...]

> +/* Does specific txq have a real qdisc attached? - see noqueue_init() */
> +static inline bool txq_has_qdisc(struct netdev_queue *txq)
> +{
> +	struct Qdisc *q;
> +
> +	q = rcu_dereference(txq->qdisc);
> +	if (q->enqueue)
> +		return true;
> +	else
> +		return false;
> +}

This seems like a pretty ugly layering violation, inspecting the qdisc
like this in the driver?

AFAICT, __dev_queue_xmit() turns a stopped queue into drops anyway, but
emits a warning (looks like this, around line 4640 in dev.c):

			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
					     dev->name);

As this patch shows, it can clearly be appropriate for a virtual device
to stop the queue even if there's no qdisc, so how about we just get rid
of that warning? Then this logic won't be needed at all in the driver..

-Toke