Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 20 May 2025 15:56:24 -0600

On Tue, 20 May 2025 19:41:19 +0200
David Hildenbrand <david@xxxxxxxxxx> wrote:

> On 20.05.25 18:59, Peter Xu wrote:
> > Hi, Alex,
> > 
> > On Tue, May 20, 2025 at 08:07:19AM -0600, Alex Williamson wrote:  
> >> Peter, David, if you wouldn't mind double checking the folio usage
> >> here, I'd appreciate it.  The underlying assumption used here is that
> >> folios always have physically contiguous pages, so we can increment at
> >> the remainder of the folio_nr_pages() rather than iterate each page.  
> > 
> > Yes I think so.  E.g., there's comment above folio definition too:  
> 
> It has consecutive PFNs, yes (i.e., pfn++). The "struct page" might not 
> be consecutive (i.e., page++ does not work for larger folios).

The former, contiguous PFNs is all we need here.  We're feeding the
IOMMU mapping, so we're effectively just looking for the largest extent
of contiguous PFNs for mapping a given IOVA.  The struct page is really
just for GUP, finding the next pfn, and with this, finding the offset
into the large folio.

> > /**
> >   * struct folio - Represents a contiguous set of bytes.
> >   * ...
> >   * A folio is a physically, virtually and logically contiguous set
> >   * of bytes...
> >   */
> > 
> > For 1G, I wonder if in the future vfio can also use memfd_pin_folios()
> > internally when possible, e.g. after stumbled on top of a hugetlb folio
> > when filling the batch.  
> 
> Yeah, or have a better GUP interface that gives us folio ranges instead 
> of individual pages.
> 
> Using memfd directly is obviously better where possible.

Yeah, we brought up some of these issues during previous reviews.
Ultimately we want to move to IOMMUFD, which already has these
features, but it still lacks P2P DMA mapping and isn't as well
supported by various management stacks.  This leaves some performance
on the table, but has a pretty high return for a relatively trivial
change.  Thanks,

Alex