On 3/13/25 14:18, Mike Christie wrote: > The following patches were made over Linus's tree. They implement > a virtual PCI NVMe device using mdev/vfio. The device can be used > by QEMU and in the guest will look like a normal old local PCI > NVMe drive. > > They are based on Maxim Levitsky's mdev patches: > > https://lore.kernel.org/lkml/20190506125752.GA5288@xxxxxx/t/ > > but instead of trying to export a physical NVMe device to a guest, they > are only focused on exporting a virtual device using the nvmet layer. > > Why another driver when we have so many? Performance. > ===================================================== > Without any tuning and major locks still in the main IO path, 4K IOPS for > a single controller with a single namespace are higher than the kernel > vhost-scsi driver and SPDK vhost-scsi/blk user when using lower number > of queues/cpus/jobs. At just 2 queues, we are able to hit 1M IOPS: > > Note: the nvme mdev values below have the shadow doorbell enabled > > mdev vhost-scsi vhost-scsi-usr vhost-blk-usr > numjobs > 1 518K 198K 332K 301K > 2 1037K 363K 609K 664K > 4 974K 633K 1369K 1383K > 8 813K 1788K 1358K 1363K > > However, by default we can't scale. But, tuning mdev to pre-pin pages > (this requires patches to the vfio layer to support) then it also performs > better at lower and higher number of queues/cpus/jobs used with it > reaching 2.3M IOPS woth only 4 cpus/queues used: > > mdev > numjobs > 1 505K > 2 1037K > 4 2375K > 8 2162K > > If we agree on a new virtual NVMe driver being ok, why mdev vs vhost? > ===================================================================== > The problem with a vhost nvme is: > > 2.1. If we do a fully vhost nvmet solution, it will require new guest > drivers that present NVMe interfaces to userspace then perform the > vhost spec on the backend like how vhost-scsi does. > > I don't want to implement a windows or even a linux nvme vhost > driver. I don't think anyone wants the extra headache. > > 2.2. We can do a hybrid approach where in the guest it looks like we > are a normal old local NVMe drive and use the guest's native NVMe driver. > However in QEMU we would have a vhost nvme module that instead of using > vhost virtqueues handles virtual PCI memory accesses as well as a vhost > nvme kernel or user driver to process IO. > > So not as much extra code as option 1 since we don't have to worry about > the guest but still extra QEMU code. > > 3. The mdev based solution does not have these drawbacks as it can > look like a normal old local NVMe drive to the guest and can use QEMU's > existing vfio layer. So it just requires the kernel driver. > > Why not a new blk driver or why not vdpa blk? > ============================================= > Applications want standardized interfaces for things like persistent > reservations. They have to support them with SCSI and NVMe already > and don't want to have to support a new virtio block interface. > > Also the nvmet-mdev-pci driver in this patchset can perform was well > as SPDK vhost blk so that doesn't have the perf advantage like it > used to. > > Status > ====== > This patchset is RFC quality only. You can discover a drive and do > IO but it's not stable. There's several TODO items mentioned in the > last patch. However, I think the patches are at the point where I > wanted to get some feedback about if this even acceptable because > the last time they were posted some people did not like how > they hooked into drivers/nvme/host (this has been fixed in this > posting). There's some other issues like: > > 1. Should the driver integrate with pci-epf (the drivers work very > differently but could share some code)? Will have a look. > > 2. Should it try to fit into the existing configfs interface or implement > it's own like how pci-epf did? I did an attempt for this but it feels > wrong. Note that the configfs for pci-epf is supported by the PCI endpoint infrastructure. It is not all implemented by the driver alone. -- Damien Le Moal Western Digital Research