On Thu, Jul 3, 2025 at 9:50 PM Jonah Palmer <jonah.palmer@xxxxxxxxxx> wrote: > > > > On 7/2/25 11:13 PM, Jason Wang wrote: > > On Wed, Jul 2, 2025 at 6:57 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > >> > >> On Wed, Jul 02, 2025 at 05:29:18PM +0800, Jason Wang wrote: > >>> On Tue, Jul 1, 2025 at 2:57 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > >>>> > >>>> On Mon, Jun 16, 2025 at 04:25:17PM +0800, Jason Wang wrote: > >>>>> This patch implements in order support for both split virtqueue and > >>>>> packed virtqueue. > >>>> > >>>> I'd like to see more motivation for this work, documented. > >>>> It's not really performance, not as it stands, see below: > >>>> > >>>>> > >>>>> Benchmark with KVM guest + testpmd on the host shows: > >>>>> > >>>>> For split virtqueue: no obvious differences were noticed > >>>>> > >>>>> For packed virtqueue: > >>>>> > >>>>> 1) RX gets 3.1% PPS improvements from 6.3 Mpps to 6.5 Mpps > >>>>> 2) TX gets 4.6% PPS improvements from 8.6 Mpps to 9.0 Mpps > >>>>> > >>>> > >>>> That's a very modest improvement for a lot of code. > >>>> I also note you put in some batching just for in-order. > >>>> Which could also explain the gains maybe? > >>>> What if you just put in a simple implementation with no > >>>> batching tricks? do you still see a gain? > >>> > >>> It is used to implement the batch used updating. > >>> > >>> """ > >>> Some devices always use descriptors in the same order in which they > >>> have been made available. These devices can offer the > >>> VIRTIO_F_IN_ORDER feature. If negotiated, this knowledge allows > >>> devices to notify the use of a batch of buffers to the driver by only > >>> writing out a single used ring entry with the id corresponding to the > >>> head entry of the descriptor chain describing the last buffer in the > >>> batch. > >>> """ > >>> > >>> DPDK implements this behavior, so it's a must for the virtio driver. > >>> > >>>> Does any hardware implement this? Maybe that can demonstrate > >>>> bigger gains. > >>> > >>> Maybe but I don't have one in my hand. > >>> > >>> For performance, I think it should be sufficient as a starter. I can > >>> say in the next version something like "more optimizations could be > >>> done on top" > >> > >> What are some optimizations you have in mind? > > > > One thing in my mind, spec currently said: > > > > """ > > If negotiated, this knowledge allows devices to notify the use of a > > batch of buffers to the driver by only writing out a single used > > descriptor with the Buffer ID corresponding to the last descriptor in > > the batch. > > """ > > > > If the device writes the last descriptor ID instead of the buffer ID > > and skip the number of descriptors in the used ring. For split > > virtqueue, the avail ring is not needed anymore. Device knows the > > availability of buffers via avail_idx. In this way, we completely > > eliminate the access of the available ring. This reduces the memory > > access which is expensive for both: > > > > 1) kernel vhost-net where small user space memory access is expensive > > 2) hardware PCI transactions > > > > Does this make sense? > > > >> > >> > >> > >>> Note that the patch that introduces packed virtqueue, there's not even > >>> any numbers: > >>> > >>> commit 1ce9e6055fa0a9043405c5604cf19169ec5379ff > >>> Author: Tiwei Bie <tiwei.bie@xxxxxxxxx> > >>> Date: Wed Nov 21 18:03:27 2018 +0800 > >>> > >>> virtio_ring: introduce packed ring support > >>> > >>> Introduce the packed ring support. Packed ring can only be > >>> created by vring_create_virtqueue() and each chunk of packed > >>> ring will be allocated individually. Packed ring can not be > >>> created on preallocated memory by vring_new_virtqueue() or > >>> the likes currently. > >>> > >>> Signed-off-by: Tiwei Bie <tiwei.bie@xxxxxxxxx> > >>> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx> > >> > >> > >> I think the assumption there was that intel has hardware that > >> requires packed. That's why Dave merged this. > > > > I think it should according to Qemu patch: > > > > commit c03213fdc9d7b680cc575cd1e725750702a10b09 > > Author: Jonah Palmer <jonah.palmer@xxxxxxxxxx> > > Date: Wed Jul 10 08:55:18 2024 -0400 > > > > vhost,vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits > > > > Add support for the VIRTIO_F_IN_ORDER feature across a variety of vhost > > devices. > > > > The inclusion of VIRTIO_F_IN_ORDER in the feature bits arrays for these > > devices ensures that the backend is capable of offering and providing > > support for this feature, and that it can be disabled if the backend > > does not support it. > > > > Acked-by: Eugenio Pérez <eperezma@xxxxxxxxxx> > > Signed-off-by: Jonah Palmer <jonah.palmer@xxxxxxxxxx> > > Message-Id: <20240710125522.4168043-6-jonah.palmer@xxxxxxxxxx> > > Reviewed-by: Michael S. Tsirkin <mst@xxxxxxxxxx> > > Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx> > > > > Adding Jonah for more thought here. > > > > Hi. By "it should", are you referring to Intel having hardware requiring > this feature? Sorry, just having a bit of trouble following. I meant I guess you may have the hardware that supports IN_ORDER. > > In any case, this looks like a good first implementation that can be > used as a foundation for future implementations to further improve its > performance. Right. > > This was the thought process I had when I implemented this feature in > Qemu. That is, get a solid framework down that supports this feature > (which had some small, modest performance improvements for some devices) > and then add on future patches (perhaps device-specific and/or general > implementations) that take advantage of the fact that data follows this > FIFO model. > > As Jason mentioned, one such future implementation could remove the need > for the use of the avail ring in the split VQ case since the device > wouldn't need to read it to learn which descriptor comes next. Right. > > Another example could be with vhost-net / vhost-vdpa. Currently each > queue tracks 3 separate indices and keeps a per-descriptor bookkeeping > table to handle buffers completing out of order. If the backend knows > data is FIFO, we might be able to drop these trackers and just use a > head and tail counter with a single contiguous iovec ring. This could > result in a smaller cache footprint and fewer DMAs. Exactly. So I had a quick hack on vhost-net to implement this optimization (bypasses the reading of available ring). It gives us 15.8% improvement with this series. DPDK gives me 10% PPS as well. https://github.com/jasowang/net/commit/719ce1d24e3369cde9c5fcc8c6d7518ca9022e5f I think it should be sufficient to move forward even without real hardware support. Maybe vhost-net part can go first. Thanks > > Jonah > > >> > >>>> > >>>> > >>>>> Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx> > >>>>> --- > >>>>> drivers/virtio/virtio_ring.c | 423 +++++++++++++++++++++++++++++++++-- > >>>>> 1 file changed, 402 insertions(+), 21 deletions(-) > >>>>> > >>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > >>>>> index 27a9459a0555..21d456392ba0 100644 > >>>>> --- a/drivers/virtio/virtio_ring.c > >>>>> +++ b/drivers/virtio/virtio_ring.c > >>>>> @@ -70,11 +70,14 @@ > >>>>> enum vq_layout { > >>>>> SPLIT = 0, > >>>>> PACKED, > >>>>> + SPLIT_IN_ORDER, > >>>>> + PACKED_IN_ORDER, > >>>>> VQ_TYPE_MAX, > >>>>> }; > >>>>> > >>>>> struct vring_desc_state_split { > >>>>> void *data; /* Data for callback. */ > >>>>> + u32 total_len; /* Buffer Length */ > >>>>> > >>>>> /* Indirect desc table and extra table, if any. These two will be > >>>>> * allocated together. So we won't stress more to the memory allocator. > >>>>> @@ -84,6 +87,7 @@ struct vring_desc_state_split { > >>>>> > >>>>> struct vring_desc_state_packed { > >>>>> void *data; /* Data for callback. */ > >>>>> + u32 total_len; /* Buffer Length */ > >>>>> > >>>>> /* Indirect desc table and extra table, if any. These two will be > >>>>> * allocated together. So we won't stress more to the memory allocator. > >>>> > >>>> We are bloating up the cache footprint for everyone, > >>>> so there's a chance of regressions. > >>>> Pls include benchmark for in order off, to make sure we > >>>> are not regressing. > >>> > >>> Ok. > >>> > >>>> How big was the ring? > >>> > >>> 256. > >> > >> that is very modest, you want to fill at least one cache way, > >> preferably more. > > > > I can test larger queue sizes. > > > >> > >>>> Worth trying with a biggish one, where there is more cache > >>>> pressure. > >>> > >>> Ok. > >>> > >>>> > >>>> > >>>> Why not have a separate state for in-order? > >>> > >>> It can work. > >>> > >>>> > >>>> > >>>> > >>>>> @@ -206,6 +210,12 @@ struct vring_virtqueue { > >>>>> > >>>>> /* Head of free buffer list. */ > >>>>> unsigned int free_head; > >>>>> + > >>>>> + /* Head of the batched used buffers, vq->num means no batching */ > >>>>> + unsigned int batch_head; > >>>>> + > >>>>> + unsigned int batch_len; > >>>>> + > >>>> > >>>> Are these two only used for in-order? Please document that. > >>> > >>> Yes, I will do that. > >>> > >>>> I also want some documentation about the batching trickery > >>>> used please. > >>>> What is batched, when, how is batching flushed, why are we > >>>> only batching in-order ... > >>> > >>> I'm not sure I get things like this, what you want seems to be the > >>> behaviour of the device which has been stated by the spec or I may > >>> miss something here. > >> > >> "a single used ring entry with the id corresponding to the > >> head entry of the descriptor chain describing the last buffer in the > >> batch" > >> ? > > > > Exactly. > > > >> > >> so together they form this used ring entry describing the last buffer? > >> "head" is the id and "len" the length? > > > > Yes. > > > >> > >> maybe > >> > >> /* > >> * With IN_ORDER, devices write a single used ring entry with > >> * the id corresponding to the head entry of the descriptor chain > >> * describing the last buffer in the batch > >> */ > >> struct used_entry { > >> u32 id; > >> u32 len; > >> } batch_last; > >> > >> ? > > > > This should be fine. > > > >> > >> > >> > >> > >>>> > >>>> > >>>> > >>>> > >>>>> /* Number we've added since last sync. */ > >>>>> unsigned int num_added; > >>>>> > >>>>> @@ -256,10 +266,14 @@ static void vring_free(struct virtqueue *_vq); > >>>>> > >>>>> #define to_vvq(_vq) container_of_const(_vq, struct vring_virtqueue, vq) > >>>>> > >>>>> - > >>>>> static inline bool virtqueue_is_packed(const struct vring_virtqueue *vq) > >>>>> { > >>>>> - return vq->layout == PACKED; > >>>>> + return vq->layout == PACKED || vq->layout == PACKED_IN_ORDER; > >>>>> +} > >>>>> + > >>>>> +static inline bool virtqueue_is_in_order(const struct vring_virtqueue *vq) > >>>>> +{ > >>>>> + return vq->layout == SPLIT_IN_ORDER || vq->layout == PACKED_IN_ORDER; > >>>>> } > >>>>> > >>>>> static bool virtqueue_use_indirect(const struct vring_virtqueue *vq, > >>>>> @@ -570,7 +584,7 @@ static inline int virtqueue_add_split(struct vring_virtqueue *vq, > >>>>> struct vring_desc_extra *extra; > >>>>> struct scatterlist *sg; > >>>>> struct vring_desc *desc; > >>>>> - unsigned int i, n, c, avail, descs_used, err_idx; > >>>>> + unsigned int i, n, c, avail, descs_used, err_idx, total_len = 0; > >>>> > >>>> > >>>> I would add a comment here: > >>>> > >>>> /* Total length for in-order */ > >>>> unsigned int total_len = 0; > >>> > >>> Ok. > >>> > >>> Thanks > >> > > >