On Wed, May 21, 2025 at 08:35:47AM +0200, David Hildenbrand wrote: > On 21.05.25 00:21, Alex Williamson wrote: > > On Tue, 20 May 2025 19:38:45 +0200 > > David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > On 20.05.25 09:00, lizhe.67@xxxxxxxxxxxxx wrote: > > > > From: Li Zhe <lizhe.67@xxxxxxxxxxxxx> > > > > > > Subject: "huge folio" -> "large folios" > > > > > > > > > > > When vfio_pin_pages_remote() is called with a range of addresses that > > > > includes huge folios, the function currently performs individual > > > > > > Similar, we call it a "large" f > > > > > > > statistics counting operations for each page. This can lead to significant > > > > performance overheads, especially when dealing with large ranges of pages. > > > > > > > > This patch optimize this process by batching the statistics counting > > > > operations. > > > > > > > > The performance test results for completing the 8G VFIO IOMMU DMA mapping, > > > > obtained through trace-cmd, are as follows. In this case, the 8G virtual > > > > address space has been mapped to physical memory using hugetlbfs with > > > > pagesize=2M. > > > > > > > > Before this patch: > > > > funcgraph_entry: # 33813.703 us | vfio_pin_map_dma(); > > > > > > > > After this patch: > > > > funcgraph_entry: # 15635.055 us | vfio_pin_map_dma(); > > > > > > > > Signed-off-by: Li Zhe <lizhe.67@xxxxxxxxxxxxx> > > > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > > --- > > > > Changelogs: > > > > > > > > v2->v3: > > > > - Code simplification. > > > > - Fix some issues in comments. > > > > > > > > v1->v2: > > > > - Fix some issues in comments and formatting. > > > > - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn(). > > > > - Move the processing logic for huge folio into the while(true) loop > > > > and use a variable with a default value of 1 to indicate the number > > > > of consecutive pages. > > > > > > > > v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@xxxxxxxxxxxxx/ > > > > v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@xxxxxxxxxxxxx/ > > > > > > > > drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++-------- > > > > 1 file changed, 37 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > > > index 0ac56072af9f..48f06ce0e290 100644 > > > > --- a/drivers/vfio/vfio_iommu_type1.c > > > > +++ b/drivers/vfio/vfio_iommu_type1.c > > > > @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu) > > > > /* > > > > * Helper Functions for host iova-pfn list > > > > */ > > > > -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova) > > > > + > > > > +/* > > > > + * Find the first vfio_pfn that overlapping the range > > > > + * [iova, iova + PAGE_SIZE * npage) in rb tree. > > > > + */ > > > > +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma, > > > > + dma_addr_t iova, unsigned long npage) > > > > { > > > > struct vfio_pfn *vpfn; > > > > struct rb_node *node = dma->pfn_list.rb_node; > > > > + dma_addr_t end_iova = iova + PAGE_SIZE * npage; > > > > while (node) { > > > > vpfn = rb_entry(node, struct vfio_pfn, node); > > > > - if (iova < vpfn->iova) > > > > + if (end_iova <= vpfn->iova) > > > > node = node->rb_left; > > > > else if (iova > vpfn->iova) > > > > node = node->rb_right; > > > > @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova) > > > > return NULL; > > > > } > > > > +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova) > > > > +{ > > > > + return vfio_find_vpfn_range(dma, iova, 1); > > > > +} > > > > + > > > > static void vfio_link_pfn(struct vfio_dma *dma, > > > > struct vfio_pfn *new) > > > > { > > > > @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, > > > > * and rsvd here, and therefore continues to use the batch. > > > > */ > > > > while (true) { > > > > + struct folio *folio = page_folio(batch->pages[batch->offset]); > > > > + long nr_pages; > > > > + > > > > if (pfn != *pfn_base + pinned || > > > > rsvd != is_invalid_reserved_pfn(pfn)) > > > > goto out; > > > > + /* > > > > + * Note: The current nr_pages does not achieve the optimal > > > > + * performance in scenarios where folio_nr_pages() exceeds > > > > + * batch->capacity. It is anticipated that future enhancements > > > > + * will address this limitation. > > > > + */ > > > > + nr_pages = min((long)batch->size, folio_nr_pages(folio) - > > > > + folio_page_idx(folio, batch->pages[batch->offset])); > > > > + if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages)) > > > > + nr_pages = 1; > > > > > > > > > You seem to assume that the batch really contains the consecutive pages > > > of that folio. > > > > I don't think we are. We're iterating through our batch of pages from > > GUP to find consecutive pfns. We use the page to get the pfn, the > > folio, and immediately above, the offset into the folio. batch->size is > > the remaining length of the page array from GUP and batch->offset is our > > current index into that array. > > Let me try again using an example below .... > > > > This is not the case if we obtained the pages through GUP and we have > > > > > > (a) A MAP_PRIVATE mapping > > > > > > (b) We span multiple different VMAs > > > > > > > > > Are we sure we can rule out (a) and (b)? > > > > > > A more future-proof approach would be at least looking whether the > > > pages/pfns are actually consecutive. > > > > The unmodified (pfn != *pfn_base + pinned) test is where we verify we > > have the next consecutive pfn. Maybe I'm not catching the dependency > > you're seeing on consecutive pages, I think there isn't one unless > > we're somehow misusing folio_page_idx() to get the offset into the > > folio. > > Assume our page tables look like this (case (a), a partially mapped large > pagecache folio mixed with COW'ed anonymous folios): > > + page[0] of folio 0 > | + COWed anonymous folio (folio 1) > | | + page[4] of folio 0 > | | | > v v v > F0P0 F0P1 F0P2 F1P0 F0P4 P0P5 F0P6 F0P7 > > If we GUP that range, we get exactly these pages, except that the PFNs are > not consecutive, because F0P3 was replaced by another page. The large folio > is partially mapped. > > > Maybe I misunderstand that code, but wouldn't we just "jump" over F1P0 > because we assume the batch would contain F1P0, where it really contains > F0P4? Looks like a real issue (even if unlikely setup).. Before a next-gen GUP.. Maybe we should stick with memfd_pin_folios(), that'll require mmap read lock taken though when seeing a hugetlb folio, so it'll be a fast path to try to ping hugetlb-only vmas. VFIO will also need to check in the fast path on: (1) double check it's a hugetlb VMA (in case vma layout changed after GUP and before mmap read lock), (2) VMA boundary check, making sure it's not out-of-bound (3) stick with VM_SHARED only for now (I don't think anyone uses MAP_PRIVATE anyway that will also care about how fast it pins..). Then 1G will also work there.. Thanks, -- Peter Xu