Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio

Peter Xu <peterx@xxxxxxxxxx> · Wed, 21 May 2025 11:40:46 -0400

On Wed, May 21, 2025 at 08:35:47AM +0200, David Hildenbrand wrote:
> On 21.05.25 00:21, Alex Williamson wrote:
> > On Tue, 20 May 2025 19:38:45 +0200
> > David Hildenbrand <david@xxxxxxxxxx> wrote:
> > 
> > > On 20.05.25 09:00, lizhe.67@xxxxxxxxxxxxx wrote:
> > > > From: Li Zhe <lizhe.67@xxxxxxxxxxxxx>
> > > 
> > > Subject: "huge folio" -> "large folios"
> > > 
> > > > 
> > > > When vfio_pin_pages_remote() is called with a range of addresses that
> > > > includes huge folios, the function currently performs individual
> > > 
> > > Similar, we call it a "large" f
> > > 
> > > > statistics counting operations for each page. This can lead to significant
> > > > performance overheads, especially when dealing with large ranges of pages.
> > > > 
> > > > This patch optimize this process by batching the statistics counting
> > > > operations.
> > > > 
> > > > The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > > > obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > > > address space has been mapped to physical memory using hugetlbfs with
> > > > pagesize=2M.
> > > > 
> > > > Before this patch:
> > > > funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> > > > 
> > > > After this patch:
> > > > funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> > > > 
> > > > Signed-off-by: Li Zhe <lizhe.67@xxxxxxxxxxxxx>
> > > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > > ---
> > > > Changelogs:
> > > > 
> > > > v2->v3:
> > > > - Code simplification.
> > > > - Fix some issues in comments.
> > > > 
> > > > v1->v2:
> > > > - Fix some issues in comments and formatting.
> > > > - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
> > > > - Move the processing logic for huge folio into the while(true) loop
> > > >     and use a variable with a default value of 1 to indicate the number
> > > >     of consecutive pages.
> > > > 
> > > > v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@xxxxxxxxxxxxx/
> > > > v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@xxxxxxxxxxxxx/
> > > > 
> > > >    drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
> > > >    1 file changed, 37 insertions(+), 11 deletions(-)
> > > > 
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index 0ac56072af9f..48f06ce0e290 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
> > > >    /*
> > > >     * Helper Functions for host iova-pfn list
> > > >     */
> > > > -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > > +
> > > > +/*
> > > > + * Find the first vfio_pfn that overlapping the range
> > > > + * [iova, iova + PAGE_SIZE * npage) in rb tree.
> > > > + */
> > > > +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> > > > +		dma_addr_t iova, unsigned long npage)
> > > >    {
> > > >    	struct vfio_pfn *vpfn;
> > > >    	struct rb_node *node = dma->pfn_list.rb_node;
> > > > +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> > > >    	while (node) {
> > > >    		vpfn = rb_entry(node, struct vfio_pfn, node);
> > > > -		if (iova < vpfn->iova)
> > > > +		if (end_iova <= vpfn->iova)
> > > >    			node = node->rb_left;
> > > >    		else if (iova > vpfn->iova)
> > > >    			node = node->rb_right;
> > > > @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > >    	return NULL;
> > > >    }
> > > > +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > > +{
> > > > +	return vfio_find_vpfn_range(dma, iova, 1);
> > > > +}
> > > > +
> > > >    static void vfio_link_pfn(struct vfio_dma *dma,
> > > >    			  struct vfio_pfn *new)
> > > >    {
> > > > @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > > >    		 * and rsvd here, and therefore continues to use the batch.
> > > >    		 */
> > > >    		while (true) {
> > > > +			struct folio *folio = page_folio(batch->pages[batch->offset]);
> > > > +			long nr_pages;
> > > > +
> > > >    			if (pfn != *pfn_base + pinned ||
> > > >    			    rsvd != is_invalid_reserved_pfn(pfn))
> > > >    				goto out;
> > > > +			/*
> > > > +			 * Note: The current nr_pages does not achieve the optimal
> > > > +			 * performance in scenarios where folio_nr_pages() exceeds
> > > > +			 * batch->capacity. It is anticipated that future enhancements
> > > > +			 * will address this limitation.
> > > > +			 */
> > > > +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
> > > > +						folio_page_idx(folio, batch->pages[batch->offset]));
> > > > +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> > > > +				nr_pages = 1;
> > > 
> > > 
> > > You seem to assume that the batch really contains the consecutive pages
> > > of that folio.
> > 
> > I don't think we are.  We're iterating through our batch of pages from
> > GUP to find consecutive pfns.  We use the page to get the pfn, the
> > folio, and immediately above, the offset into the folio.  batch->size is
> > the remaining length of the page array from GUP and batch->offset is our
> > current index into that array.
> 
> Let me try again using an example below ....
> 
> > > This is not the case if we obtained the pages through GUP and we have
> > > 
> > > (a) A MAP_PRIVATE mapping
> > > 
> > > (b) We span multiple different VMAs
> > > 
> > > 
> > > Are we sure we can rule out (a) and (b)?
> > > 
> > > A more future-proof approach would be at least looking whether the
> > > pages/pfns are actually consecutive.
> > 
> > The unmodified (pfn != *pfn_base + pinned) test is where we verify we
> > have the next consecutive pfn.  Maybe I'm not catching the dependency
> > you're seeing on consecutive pages, I think there isn't one unless
> > we're somehow misusing folio_page_idx() to get the offset into the
> > folio.
> 
> Assume our page tables look like this (case (a), a partially mapped large
> pagecache folio mixed with COW'ed anonymous folios):
> 
>   + page[0] of folio 0
>   |              + COWed anonymous folio (folio 1)
>   |              |    + page[4] of folio 0
>   |              |    |
>   v              v    v
> F0P0 F0P1 F0P2 F1P0 F0P4 P0P5 F0P6 F0P7
> 
> If we GUP that range, we get exactly these pages, except that the PFNs are
> not consecutive, because F0P3 was replaced by another page. The large folio
> is partially mapped.
> 
> 
> Maybe I misunderstand that code, but wouldn't we just "jump" over F1P0
> because we assume the batch would contain F1P0, where it really contains
> F0P4?

Looks like a real issue (even if unlikely setup)..

Before a next-gen GUP.. Maybe we should stick with memfd_pin_folios(),
that'll require mmap read lock taken though when seeing a hugetlb folio, so
it'll be a fast path to try to ping hugetlb-only vmas.

VFIO will also need to check in the fast path on: (1) double check it's a
hugetlb VMA (in case vma layout changed after GUP and before mmap read
lock), (2) VMA boundary check, making sure it's not out-of-bound (3) stick
with VM_SHARED only for now (I don't think anyone uses MAP_PRIVATE anyway
that will also care about how fast it pins..). Then 1G will also work
there..

Thanks,

-- 
Peter Xu