Jesper Dangaard Brouer <hawk@xxxxxxxxxx> writes: > In production, we're seeing TX drops on veth devices when the ptr_ring > fills up. This can occur when NAPI mode is enabled, though it's > relatively rare. However, with threaded NAPI - which we use in > production - the drops become significantly more frequent. > > The underlying issue is that with threaded NAPI, the consumer often runs > on a different CPU than the producer. This increases the likelihood of > the ring filling up before the consumer gets scheduled, especially under > load, leading to drops in veth_xmit() (ndo_start_xmit()). > > This patch introduces backpressure by returning NETDEV_TX_BUSY when the > ring is full, signaling the qdisc layer to requeue the packet. The txq > (netdev queue) is stopped in this condition and restarted once > veth_poll() drains entries from the ring, ensuring coordination between > NAPI and qdisc. Right, I definitely agree that this is the right solution; having no backpressure and a fixed-size ringbuffer is obviously not ideal. > Backpressure is only enabled when a qdisc is attached. Without a qdisc, > the driver retains its original behavior - dropping packets immediately > when the ring is full. This avoids unexpected behavior changes in setups > without a configured qdisc. Not sure I like this bit, though; see below. > With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management > (AQM) to fairly schedule packets across flows and reduce collateral > damage from elephant flows. > > A known limitation of this approach is that the full ring sits in front > of the qdisc layer, effectively forming a FIFO buffer that introduces > base latency. While AQM still improves fairness and mitigates flow > dominance, the latency impact is measurable. > > In hardware drivers, this issue is typically addressed using BQL (Byte > Queue Limits), which tracks in-flight bytes needed based on physical link > rate. However, for virtual drivers like veth, there is no fixed bandwidth > constraint - the bottleneck is CPU availability and the scheduler's ability > to run the NAPI thread. It is unclear how effective BQL would be in this > context. So the BQL algorithm tries to tune the maximum number of outstanding bytes to be ~twice the maximum that can be completed in one batch. Since we're not really limited by bytes in the same sense here (as you point out), an approximate equivalent would be the NAPI budget, I guess? I.e., as a first approximation, we could have veth stop the queue once the ringbuffer has 2x the NAPI budget packets in it? > This patch serves as a first step toward addressing TX drops. Future work > may explore adapting a BQL-like mechanism to better suit virtual devices > like veth. > > Reported-by: Yan Zhai <yan@xxxxxxxxxxxxxx> > Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> [...] > +/* Does specific txq have a real qdisc attached? - see noqueue_init() */ > +static inline bool txq_has_qdisc(struct netdev_queue *txq) > +{ > + struct Qdisc *q; > + > + q = rcu_dereference(txq->qdisc); > + if (q->enqueue) > + return true; > + else > + return false; > +} This seems like a pretty ugly layering violation, inspecting the qdisc like this in the driver? AFAICT, __dev_queue_xmit() turns a stopped queue into drops anyway, but emits a warning (looks like this, around line 4640 in dev.c): net_crit_ratelimited("Virtual device %s asks to queue packet!\n", dev->name); As this patch shows, it can clearly be appropriate for a virtual device to stop the queue even if there's no qdisc, so how about we just get rid of that warning? Then this logic won't be needed at all in the driver.. -Toke