On Tue, Jul 1, 2025 at 2:57 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > > On Mon, Jun 16, 2025 at 04:25:17PM +0800, Jason Wang wrote: > > This patch implements in order support for both split virtqueue and > > packed virtqueue. > > I'd like to see more motivation for this work, documented. > It's not really performance, not as it stands, see below: > > > > > Benchmark with KVM guest + testpmd on the host shows: > > > > For split virtqueue: no obvious differences were noticed > > > > For packed virtqueue: > > > > 1) RX gets 3.1% PPS improvements from 6.3 Mpps to 6.5 Mpps > > 2) TX gets 4.6% PPS improvements from 8.6 Mpps to 9.0 Mpps > > > > That's a very modest improvement for a lot of code. > I also note you put in some batching just for in-order. > Which could also explain the gains maybe? > What if you just put in a simple implementation with no > batching tricks? do you still see a gain? It is used to implement the batch used updating. """ Some devices always use descriptors in the same order in which they have been made available. These devices can offer the VIRTIO_F_IN_ORDER feature. If negotiated, this knowledge allows devices to notify the use of a batch of buffers to the driver by only writing out a single used ring entry with the id corresponding to the head entry of the descriptor chain describing the last buffer in the batch. """ DPDK implements this behavior, so it's a must for the virtio driver. > Does any hardware implement this? Maybe that can demonstrate > bigger gains. Maybe but I don't have one in my hand. For performance, I think it should be sufficient as a starter. I can say in the next version something like "more optimizations could be done on top" Note that the patch that introduces packed virtqueue, there's not even any numbers: commit 1ce9e6055fa0a9043405c5604cf19169ec5379ff Author: Tiwei Bie <tiwei.bie@xxxxxxxxx> Date: Wed Nov 21 18:03:27 2018 +0800 virtio_ring: introduce packed ring support Introduce the packed ring support. Packed ring can only be created by vring_create_virtqueue() and each chunk of packed ring will be allocated individually. Packed ring can not be created on preallocated memory by vring_new_virtqueue() or the likes currently. Signed-off-by: Tiwei Bie <tiwei.bie@xxxxxxxxx> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx> > > > > Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx> > > --- > > drivers/virtio/virtio_ring.c | 423 +++++++++++++++++++++++++++++++++-- > > 1 file changed, 402 insertions(+), 21 deletions(-) > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > index 27a9459a0555..21d456392ba0 100644 > > --- a/drivers/virtio/virtio_ring.c > > +++ b/drivers/virtio/virtio_ring.c > > @@ -70,11 +70,14 @@ > > enum vq_layout { > > SPLIT = 0, > > PACKED, > > + SPLIT_IN_ORDER, > > + PACKED_IN_ORDER, > > VQ_TYPE_MAX, > > }; > > > > struct vring_desc_state_split { > > void *data; /* Data for callback. */ > > + u32 total_len; /* Buffer Length */ > > > > /* Indirect desc table and extra table, if any. These two will be > > * allocated together. So we won't stress more to the memory allocator. > > @@ -84,6 +87,7 @@ struct vring_desc_state_split { > > > > struct vring_desc_state_packed { > > void *data; /* Data for callback. */ > > + u32 total_len; /* Buffer Length */ > > > > /* Indirect desc table and extra table, if any. These two will be > > * allocated together. So we won't stress more to the memory allocator. > > We are bloating up the cache footprint for everyone, > so there's a chance of regressions. > Pls include benchmark for in order off, to make sure we > are not regressing. Ok. > How big was the ring? 256. > Worth trying with a biggish one, where there is more cache > pressure. Ok. > > > Why not have a separate state for in-order? It can work. > > > > > @@ -206,6 +210,12 @@ struct vring_virtqueue { > > > > /* Head of free buffer list. */ > > unsigned int free_head; > > + > > + /* Head of the batched used buffers, vq->num means no batching */ > > + unsigned int batch_head; > > + > > + unsigned int batch_len; > > + > > Are these two only used for in-order? Please document that. Yes, I will do that. > I also want some documentation about the batching trickery > used please. > What is batched, when, how is batching flushed, why are we > only batching in-order ... I'm not sure I get things like this, what you want seems to be the behaviour of the device which has been stated by the spec or I may miss something here. > > > > > > /* Number we've added since last sync. */ > > unsigned int num_added; > > > > @@ -256,10 +266,14 @@ static void vring_free(struct virtqueue *_vq); > > > > #define to_vvq(_vq) container_of_const(_vq, struct vring_virtqueue, vq) > > > > - > > static inline bool virtqueue_is_packed(const struct vring_virtqueue *vq) > > { > > - return vq->layout == PACKED; > > + return vq->layout == PACKED || vq->layout == PACKED_IN_ORDER; > > +} > > + > > +static inline bool virtqueue_is_in_order(const struct vring_virtqueue *vq) > > +{ > > + return vq->layout == SPLIT_IN_ORDER || vq->layout == PACKED_IN_ORDER; > > } > > > > static bool virtqueue_use_indirect(const struct vring_virtqueue *vq, > > @@ -570,7 +584,7 @@ static inline int virtqueue_add_split(struct vring_virtqueue *vq, > > struct vring_desc_extra *extra; > > struct scatterlist *sg; > > struct vring_desc *desc; > > - unsigned int i, n, c, avail, descs_used, err_idx; > > + unsigned int i, n, c, avail, descs_used, err_idx, total_len = 0; > > > I would add a comment here: > > /* Total length for in-order */ > unsigned int total_len = 0; Ok. Thanks