Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 15 May 2025 15:19:46 -0600

On Tue, 13 May 2025 11:57:30 +0800
lizhe.67@xxxxxxxxxxxxx wrote:

> From: Li Zhe <lizhe.67@xxxxxxxxxxxxx>
> 
> When vfio_pin_pages_remote() is called with a range of addresses that
> includes hugetlbfs folios, the function currently performs individual
> statistics counting operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the statistics counting
> operations.
> 
> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> address space has been mapped to physical memory using hugetlbfs with
> pagesize=2M.
> 
> Before this patch:
> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> 
> After this patch:
> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> 
> Signed-off-by: Li Zhe <lizhe.67@xxxxxxxxxxxxx>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)

Hi,

Thanks for looking at improvements in this area...

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0ac56072af9f..bafa7f8c4cc6 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -337,6 +337,30 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>  	return NULL;
>  }
>  
> +/*
> + * Find a random vfio_pfn that belongs to the range
> + * [iova, iova + PAGE_SIZE * npage)
> + */
> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> +		dma_addr_t iova, unsigned long npage)
> +{
> +	struct vfio_pfn *vpfn;
> +	struct rb_node *node = dma->pfn_list.rb_node;
> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (end_iova <= vpfn->iova)
> +			node = node->rb_left;
> +		else if (iova > vpfn->iova)
> +			node = node->rb_right;
> +		else
> +			return vpfn;
> +	}
> +	return NULL;
> +}

This essentially duplicates vfio_find_vpfn(), where the existing
function only finds a single page.  The existing function should be
extended for this new use case and callers updated.  Also the vfio_pfn
is not "random", it's the first vfio_pfn overlapping the range.

> +
>  static void vfio_link_pfn(struct vfio_dma *dma,
>  			  struct vfio_pfn *new)
>  {
> @@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  				iova += (PAGE_SIZE * ret);
>  				continue;
>  			}
> +

Spurious new blank line.

> +		}

A new blank line here would be appreciated.

> +		/* Handle hugetlbfs page */
> +		if (likely(!disable_hugepages) &&

Isn't this already accounted for with npage = 1?

> +				folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) {

I don't follow how this guarantees the entire batch->size is
contiguous.  Isn't it possible that a batch could contain multiple
hugetlb folios?  Is the assumption here only true if folio_nr_pages()
(or specifically the pages remaining) is >= batch->capacity?  What
happens if we try to map the last half of one 2MB hugetlb page and
first half of the non-physically-contiguous next page?  Or what if the
hugetlb size is 64KB and the batch contains multiple folios that are
not physically contiguous?

> +			if (pfn != *pfn_base + pinned)
> +				goto out;
> +
> +			if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) {
> +				if (!dma->lock_cap &&
> +				    mm->locked_vm + lock_acct + batch->size > limit) {
> +					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> +						__func__, limit << PAGE_SHIFT);
> +					ret = -ENOMEM;
> +					goto unpin_out;
> +				}
> +				pinned += batch->size;
> +				npage -= batch->size;
> +				vaddr += PAGE_SIZE * batch->size;
> +				iova += PAGE_SIZE * batch->size;
> +				lock_acct += batch->size;
> +				batch->offset += batch->size;
> +				batch->size = 0;
> +				continue;
> +			}

There's a lot of duplication with the existing page-iterative loop.  I
think they could be consolidated if we extract the number of known
contiguous pages based on the folio into a variable, default 1.

Also, while this approach is an improvement, it leaves a lot on the
table in scenarios where folio_nr_pages() exceeds batch->capacity.  For
example we're at best incrementing 1GB hugetlb pages in 2MB increments.
We're also wasting a lot of cycles to fill pages points we mostly don't
use.  Thanks,

Alex