Re: [PATCH V3 19/19] virtio_ring: add in order support

Jason Wang <jasowang@xxxxxxxxxx> · Mon, 7 Jul 2025 11:28:50 +0800

On Thu, Jul 3, 2025 at 9:50 PM Jonah Palmer <jonah.palmer@xxxxxxxxxx> wrote:
>
>
>
> On 7/2/25 11:13 PM, Jason Wang wrote:
> > On Wed, Jul 2, 2025 at 6:57 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> >>
> >> On Wed, Jul 02, 2025 at 05:29:18PM +0800, Jason Wang wrote:
> >>> On Tue, Jul 1, 2025 at 2:57 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Mon, Jun 16, 2025 at 04:25:17PM +0800, Jason Wang wrote:
> >>>>> This patch implements in order support for both split virtqueue and
> >>>>> packed virtqueue.
> >>>>
> >>>> I'd like to see more motivation for this work, documented.
> >>>> It's not really performance, not as it stands, see below:
> >>>>
> >>>>>
> >>>>> Benchmark with KVM guest + testpmd on the host shows:
> >>>>>
> >>>>> For split virtqueue: no obvious differences were noticed
> >>>>>
> >>>>> For packed virtqueue:
> >>>>>
> >>>>> 1) RX gets 3.1% PPS improvements from 6.3 Mpps to 6.5 Mpps
> >>>>> 2) TX gets 4.6% PPS improvements from 8.6 Mpps to 9.0 Mpps
> >>>>>
> >>>>
> >>>> That's a very modest improvement for a lot of code.
> >>>> I also note you put in some batching just for in-order.
> >>>> Which could also explain the gains maybe?
> >>>> What if you just put in a simple implementation with no
> >>>> batching tricks? do you still see a gain?
> >>>
> >>> It is used to implement the batch used updating.
> >>>
> >>> """
> >>> Some devices always use descriptors in the same order in which they
> >>> have been made available. These devices can offer the
> >>> VIRTIO_F_IN_ORDER feature. If negotiated, this knowledge allows
> >>> devices to notify the use of a batch of buffers to the driver by only
> >>> writing out a single used ring entry with the id corresponding to the
> >>> head entry of the descriptor chain describing the last buffer in the
> >>> batch.
> >>> """
> >>>
> >>> DPDK implements this behavior, so it's a must for the virtio driver.
> >>>
> >>>> Does any hardware implement this? Maybe that can demonstrate
> >>>> bigger gains.
> >>>
> >>> Maybe but I don't have one in my hand.
> >>>
> >>> For performance, I think it should be sufficient as a starter. I can
> >>> say in the next version something like "more optimizations could be
> >>> done on top"
> >>
> >> What are some optimizations you have in mind?
> >
> > One thing in my mind, spec currently said:
> >
> > """
> >   If negotiated, this knowledge allows devices to notify the use of a
> > batch of buffers to the driver by only writing out a single used
> > descriptor with the Buffer ID corresponding to the last descriptor in
> > the batch.
> > """
> >
> > If the device writes the last descriptor ID instead of the buffer ID
> > and skip the number of descriptors in the used ring. For split
> > virtqueue, the avail ring is not needed anymore. Device knows the
> > availability of buffers via avail_idx. In this way, we completely
> > eliminate the access of the available ring. This reduces the memory
> > access which is expensive for both:
> >
> > 1) kernel vhost-net where small user space memory access is expensive
> > 2) hardware PCI transactions
> >
> > Does this make sense?
> >
> >>
> >>
> >>
> >>> Note that the patch that introduces packed virtqueue, there's not even
> >>> any numbers:
> >>>
> >>> commit 1ce9e6055fa0a9043405c5604cf19169ec5379ff
> >>> Author: Tiwei Bie <tiwei.bie@xxxxxxxxx>
> >>> Date:   Wed Nov 21 18:03:27 2018 +0800
> >>>
> >>>      virtio_ring: introduce packed ring support
> >>>
> >>>      Introduce the packed ring support. Packed ring can only be
> >>>      created by vring_create_virtqueue() and each chunk of packed
> >>>      ring will be allocated individually. Packed ring can not be
> >>>      created on preallocated memory by vring_new_virtqueue() or
> >>>      the likes currently.
> >>>
> >>>      Signed-off-by: Tiwei Bie <tiwei.bie@xxxxxxxxx>
> >>>      Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
> >>
> >>
> >> I think the assumption there was that intel has hardware that
> >> requires packed. That's why Dave merged this.
> >
> > I think it should according to Qemu patch:
> >
> > commit c03213fdc9d7b680cc575cd1e725750702a10b09
> > Author: Jonah Palmer <jonah.palmer@xxxxxxxxxx>
> > Date:   Wed Jul 10 08:55:18 2024 -0400
> >
> >      vhost,vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits
> >
> >      Add support for the VIRTIO_F_IN_ORDER feature across a variety of vhost
> >      devices.
> >
> >      The inclusion of VIRTIO_F_IN_ORDER in the feature bits arrays for these
> >      devices ensures that the backend is capable of offering and providing
> >      support for this feature, and that it can be disabled if the backend
> >      does not support it.
> >
> >      Acked-by: Eugenio Pérez <eperezma@xxxxxxxxxx>
> >      Signed-off-by: Jonah Palmer <jonah.palmer@xxxxxxxxxx>
> >      Message-Id: <20240710125522.4168043-6-jonah.palmer@xxxxxxxxxx>
> >      Reviewed-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> >      Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> >
> > Adding Jonah for more thought here.
> >
>
> Hi. By "it should", are you referring to Intel having hardware requiring
> this feature? Sorry, just having a bit of trouble following.

I meant I guess you may have the hardware that supports IN_ORDER.

>
> In any case, this looks like a good first implementation that can be
> used as a foundation for future implementations to further improve its
> performance.

Right.

>
> This was the thought process I had when I implemented this feature in
> Qemu. That is, get a solid framework down that supports this feature
> (which had some small, modest performance improvements for some devices)
> and then add on future patches (perhaps device-specific and/or general
> implementations) that take advantage of the fact that data follows this
> FIFO model.
>
> As Jason mentioned, one such future implementation could remove the need
> for the use of the avail ring in the split VQ case since the device
> wouldn't need to read it to learn which descriptor comes next.

Right.

>
> Another example could be with vhost-net / vhost-vdpa. Currently each
> queue tracks 3 separate indices and keeps a per-descriptor bookkeeping
> table to handle buffers completing out of order. If the backend knows
> data is FIFO, we might be able to drop these trackers and just use a
> head and tail counter with a single contiguous iovec ring. This could
> result in a smaller cache footprint and fewer DMAs.

Exactly.

So I had a quick hack on vhost-net to implement this optimization
(bypasses the reading of available ring). It gives us 15.8%
improvement with this series. DPDK gives me 10% PPS as well.

https://github.com/jasowang/net/commit/719ce1d24e3369cde9c5fcc8c6d7518ca9022e5f

I think it should be sufficient to move forward even without real
hardware support. Maybe vhost-net part can go first.

Thanks

>
> Jonah
>
> >>
> >>>>
> >>>>
> >>>>> Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx>
> >>>>> ---
> >>>>>   drivers/virtio/virtio_ring.c | 423 +++++++++++++++++++++++++++++++++--
> >>>>>   1 file changed, 402 insertions(+), 21 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >>>>> index 27a9459a0555..21d456392ba0 100644
> >>>>> --- a/drivers/virtio/virtio_ring.c
> >>>>> +++ b/drivers/virtio/virtio_ring.c
> >>>>> @@ -70,11 +70,14 @@
> >>>>>   enum vq_layout {
> >>>>>        SPLIT = 0,
> >>>>>        PACKED,
> >>>>> +     SPLIT_IN_ORDER,
> >>>>> +     PACKED_IN_ORDER,
> >>>>>        VQ_TYPE_MAX,
> >>>>>   };
> >>>>>
> >>>>>   struct vring_desc_state_split {
> >>>>>        void *data;                     /* Data for callback. */
> >>>>> +     u32 total_len;                  /* Buffer Length */
> >>>>>
> >>>>>        /* Indirect desc table and extra table, if any. These two will be
> >>>>>         * allocated together. So we won't stress more to the memory allocator.
> >>>>> @@ -84,6 +87,7 @@ struct vring_desc_state_split {
> >>>>>
> >>>>>   struct vring_desc_state_packed {
> >>>>>        void *data;                     /* Data for callback. */
> >>>>> +     u32 total_len;                  /* Buffer Length */
> >>>>>
> >>>>>        /* Indirect desc table and extra table, if any. These two will be
> >>>>>         * allocated together. So we won't stress more to the memory allocator.
> >>>>
> >>>> We are bloating up the cache footprint for everyone,
> >>>> so there's a chance of regressions.
> >>>> Pls include benchmark for in order off, to make sure we
> >>>> are not regressing.
> >>>
> >>> Ok.
> >>>
> >>>> How big was the ring?
> >>>
> >>> 256.
> >>
> >> that is very modest, you want to fill at least one cache way,
> >> preferably more.
> >
> > I can test larger queue sizes.
> >
> >>
> >>>> Worth trying with a biggish one, where there is more cache
> >>>> pressure.
> >>>
> >>> Ok.
> >>>
> >>>>
> >>>>
> >>>> Why not have a separate state for in-order?
> >>>
> >>> It can work.
> >>>
> >>>>
> >>>>
> >>>>
> >>>>> @@ -206,6 +210,12 @@ struct vring_virtqueue {
> >>>>>
> >>>>>        /* Head of free buffer list. */
> >>>>>        unsigned int free_head;
> >>>>> +
> >>>>> +     /* Head of the batched used buffers, vq->num means no batching */
> >>>>> +     unsigned int batch_head;
> >>>>> +
> >>>>> +     unsigned int batch_len;
> >>>>> +
> >>>>
> >>>> Are these two only used for in-order? Please document that.
> >>>
> >>> Yes, I will do that.
> >>>
> >>>> I also want some documentation about the batching trickery
> >>>> used please.
> >>>> What is batched, when, how is batching flushed, why are we
> >>>> only batching in-order ...
> >>>
> >>> I'm not sure I get things like this, what you want seems to be the
> >>> behaviour of the device which has been stated by the spec or I may
> >>> miss something here.
> >>
> >> "a single used ring entry with the id corresponding to the
> >>   head entry of the descriptor chain describing the last buffer in the
> >>   batch"
> >> ?
> >
> > Exactly.
> >
> >>
> >> so together they form this used ring entry describing the last buffer?
> >> "head" is the id and "len" the length?
> >
> > Yes.
> >
> >>
> >> maybe
> >>
> >>          /*
> >>           * With IN_ORDER, devices write a single used ring entry with
> >>           * the id corresponding to the head entry of the descriptor chain
> >>           * describing the last buffer in the batch
> >>           */
> >>          struct used_entry {
> >>                  u32 id;
> >>                  u32 len;
> >>          } batch_last;
> >>
> >> ?
> >
> > This should be fine.
> >
> >>
> >>
> >>
> >>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>        /* Number we've added since last sync. */
> >>>>>        unsigned int num_added;
> >>>>>
> >>>>> @@ -256,10 +266,14 @@ static void vring_free(struct virtqueue *_vq);
> >>>>>
> >>>>>   #define to_vvq(_vq) container_of_const(_vq, struct vring_virtqueue, vq)
> >>>>>
> >>>>> -
> >>>>>   static inline bool virtqueue_is_packed(const struct vring_virtqueue *vq)
> >>>>>   {
> >>>>> -     return vq->layout == PACKED;
> >>>>> +     return vq->layout == PACKED || vq->layout == PACKED_IN_ORDER;
> >>>>> +}
> >>>>> +
> >>>>> +static inline bool virtqueue_is_in_order(const struct vring_virtqueue *vq)
> >>>>> +{
> >>>>> +     return vq->layout == SPLIT_IN_ORDER || vq->layout == PACKED_IN_ORDER;
> >>>>>   }
> >>>>>
> >>>>>   static bool virtqueue_use_indirect(const struct vring_virtqueue *vq,
> >>>>> @@ -570,7 +584,7 @@ static inline int virtqueue_add_split(struct vring_virtqueue *vq,
> >>>>>        struct vring_desc_extra *extra;
> >>>>>        struct scatterlist *sg;
> >>>>>        struct vring_desc *desc;
> >>>>> -     unsigned int i, n, c, avail, descs_used, err_idx;
> >>>>> +     unsigned int i, n, c, avail, descs_used, err_idx, total_len = 0;
> >>>>
> >>>>
> >>>> I would add a comment here:
> >>>>
> >>>> /* Total length for in-order */
> >>>> unsigned int total_len = 0;
> >>>
> >>> Ok.
> >>>
> >>> Thanks
> >>
> >
>