On Thu, 22 May 2025 14:52:07 -0600, alex.williamson@xxxxxxxxxx wrote: > On Thu, 22 May 2025 16:25:24 +0800 > lizhe.67@xxxxxxxxxxxxx wrote: > > > On Thu, 22 May 2025 09:22:50 +0200, david@xxxxxxxxxx wrote: > > > > >On 22.05.25 05:49, lizhe.67@xxxxxxxxxxxxx wrote: > > >> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@xxxxxxxxxx wrote: > > >> > > >>>> From: Li Zhe <lizhe.67@xxxxxxxxxxxxx> > > >>>> > > >>>> When vfio_pin_pages_remote() is called with a range of addresses that > > >>>> includes large folios, the function currently performs individual > > >>>> statistics counting operations for each page. This can lead to significant > > >>>> performance overheads, especially when dealing with large ranges of pages. > > >>>> > > >>>> This patch optimize this process by batching the statistics counting > > >>>> operations. > > >>>> > > >>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping, > > >>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual > > >>>> address space has been mapped to physical memory using hugetlbfs with > > >>>> pagesize=2M. > > >>>> > > >>>> Before this patch: > > >>>> funcgraph_entry: # 33813.703 us | vfio_pin_map_dma(); > > >>>> > > >>>> After this patch: > > >>>> funcgraph_entry: # 16071.378 us | vfio_pin_map_dma(); > > >>>> > > >>>> Signed-off-by: Li Zhe <lizhe.67@xxxxxxxxxxxxx> > > >>>> Co-developed-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > >>>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > >>>> --- > > >>> > > >>> Given the discussion on v3, this is currently a Nak. Follow-up in that > > >>> thread if there are further ideas how to salvage this. Thanks, > > >> > > >> How about considering the solution David mentioned to check whether the > > >> pages or PFNs are actually consecutive? > > >> > > >> I have conducted a preliminary attempt, and the performance testing > > >> revealed that the time consumption is approximately 18,000 microseconds. > > >> Compared to the previous 33,000 microseconds, this also represents a > > >> significant improvement. > > >> > > >> The modification is quite straightforward. The code below reflects the > > >> changes I have made based on this patch. > > >> > > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > >> index bd46ed9361fe..1cc1f76d4020 100644 > > >> --- a/drivers/vfio/vfio_iommu_type1.c > > >> +++ b/drivers/vfio/vfio_iommu_type1.c > > >> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, > > >> return ret; > > >> } > > >> > > >> +static inline long continuous_page_num(struct vfio_batch *batch, long npage) > > >> +{ > > >> + long i; > > >> + unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1; > > >> + > > >> + for (i = 1; i < npage; ++i) { > > >> + if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn) > > >> + break; > > >> + next_pfn++; > > >> + } > > >> + return i; > > >> +} > > > > > > > > >What might be faster is obtaining the folio, and then calculating the > > >next expected page pointer, comparing whether the page pointers match. > > > > > >Essentially, using folio_page() to calculate the expected next page. > > > > > >nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP, > > >so that might be rather fast. > > > > > > > > >So we'd obtain > > > > > >start_idx = folio_idx(folio, batch->pages[batch->offset]); > > > > Do you mean using folio_page_idx()? > > > > >and then check for > > > > > >batch->pages[batch->offset + i] == folio_page(folio, start_idx + i) > > > > Thank you for your reminder. This is indeed a better solution. > > The updated code might look like this: > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > index bd46ed9361fe..f9a11b1d8433 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -627,6 +627,20 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, > > return ret; > > } > > > > +static inline long continuous_pages_num(struct folio *folio, > > + struct vfio_batch *batch, long npage) > > Note this becomes long enough that we should just let the compiler > decide whether to inline or not. Thank you! The 'inline' here indeed needs to be removed. > > +{ > > + long i; > > + unsigned long start_idx = > > + folio_page_idx(folio, batch->pages[batch->offset]); > > + > > + for (i = 1; i < npage; ++i) > > + if (batch->pages[batch->offset + i] != > > + folio_page(folio, start_idx + i)) > > + break; > > + return i; > > +} > > + > > /* > > * Attempt to pin pages. We really don't want to track all the pfns and > > * the iommu can only map chunks of consecutive pfns anyway, so get the > > @@ -708,8 +722,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, > > */ > > nr_pages = min_t(long, batch->size, folio_nr_pages(folio) - > > folio_page_idx(folio, batch->pages[batch->offset])); > > - if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages)) > > - nr_pages = 1; > > + if (nr_pages > 1) { > > + if (vfio_find_vpfn_range(dma, iova, nr_pages)) > > + nr_pages = 1; > > + else > > + nr_pages = continuous_pages_num(folio, batch, nr_pages); > > + } > > > I think we can refactor this a bit better and maybe if we're going to > the trouble of comparing pages we can be a bit more resilient to pages > already accounted as vpfns. I took a shot at it, compile tested only, > is there still a worthwhile gain? > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index 0ac56072af9f..e8bba32148f7 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -319,7 +319,13 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu) > /* > * Helper Functions for host iova-pfn list > */ > -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova) > + > +/* > + * Find the first vfio_pfn that overlapping the range > + * [iova_start, iova_end) in rb tree. > + */ > +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma, > + dma_addr_t iova_start, dma_addr_t iova_end) > { > struct vfio_pfn *vpfn; > struct rb_node *node = dma->pfn_list.rb_node; > @@ -327,9 +333,9 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova) > while (node) { > vpfn = rb_entry(node, struct vfio_pfn, node); > > - if (iova < vpfn->iova) > + if (iova_end <= vpfn->iova) > node = node->rb_left; > - else if (iova > vpfn->iova) > + else if (iova_start > vpfn->iova) > node = node->rb_right; > else > return vpfn; > @@ -337,6 +343,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova) > return NULL; > } > > +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova) > +{ > + return vfio_find_vpfn_range(dma, iova, iova + PAGE_SIZE); > +} > + > static void vfio_link_pfn(struct vfio_dma *dma, > struct vfio_pfn *new) > { > @@ -615,6 +626,43 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr, > return ret; > } > > +static long contig_pages(struct vfio_dma *dma, > + struct vfio_batch *batch, dma_addr_t iova) > +{ > + struct page *page = batch->pages[batch->offset]; > + struct folio *folio = page_folio(page); > + long idx = folio_page_idx(folio, page); > + long max = min_t(long, batch->size, folio_nr_pages(folio) - idx); > + long nr_pages; > + > + for (nr_pages = 1; nr_pages < max; nr_pages++) { > + if (batch->pages[batch->offset + nr_pages] != > + folio_page(folio, idx + nr_pages)) > + break; > + } > + > + return nr_pages; > +} > + > +static long vpfn_pages(struct vfio_dma *dma, > + dma_addr_t iova_start, long nr_pages) > +{ > + dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT); > + struct vfio_pfn *vpfn; > + long count = 0; > + > + do { > + vpfn = vfio_find_vpfn_range(dma, iova_start, iova_end); I am somehow confused here. Function vfio_find_vpfn_range()is designed to find, through the rbtree, the node that is closest to the root node and satisfies the condition within the range [iova_start, iova_end), rather than the node closest to iova_start? Or perhaps I have misunderstood something? > + if (likely(!vpfn)) > + break; > + > + count++; > + iova_start = vpfn->iova + PAGE_SIZE; > + } while (iova_start < iova_end); > + > + return count; > +} > + > /* > * Attempt to pin pages. We really don't want to track all the pfns and > * the iommu can only map chunks of consecutive pfns anyway, so get the > @@ -681,32 +729,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, > * and rsvd here, and therefore continues to use the batch. > */ > while (true) { > + long nr_pages, acct_pages = 0; > + > if (pfn != *pfn_base + pinned || > rsvd != is_invalid_reserved_pfn(pfn)) > goto out; > > + nr_pages = contig_pages(dma, batch, iova); > + if (!rsvd) { > + acct_pages = nr_pages; > + acct_pages -= vpfn_pages(dma, iova, nr_pages); > + } > + > /* > * Reserved pages aren't counted against the user, > * externally pinned pages are already counted against > * the user. > */ > - if (!rsvd && !vfio_find_vpfn(dma, iova)) { > + if (acct_pages) { > if (!dma->lock_cap && > - mm->locked_vm + lock_acct + 1 > limit) { > + mm->locked_vm + lock_acct + acct_pages > limit) { > pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", > __func__, limit << PAGE_SHIFT); > ret = -ENOMEM; > goto unpin_out; > } > - lock_acct++; > + lock_acct += acct_pages; > } > > - pinned++; > - npage--; > - vaddr += PAGE_SIZE; > - iova += PAGE_SIZE; > - batch->offset++; > - batch->size--; > + pinned += nr_pages; > + npage -= nr_pages; > + vaddr += PAGE_SIZE * nr_pages; > + iova += PAGE_SIZE * nr_pages; > + batch->offset += nr_pages; > + batch->size -= nr_pages; > > if (!batch->size) > break; Thanks, Zhe