On Thu, 15 May 2025 15:19:46 -0600, alex.williamson@xxxxxxxxxx wrote: >> +/* >> + * Find a random vfio_pfn that belongs to the range >> + * [iova, iova + PAGE_SIZE * npage) >> + */ >> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma, >> + dma_addr_t iova, unsigned long npage) >> +{ >> + struct vfio_pfn *vpfn; >> + struct rb_node *node = dma->pfn_list.rb_node; >> + dma_addr_t end_iova = iova + PAGE_SIZE * npage; >> + >> + while (node) { >> + vpfn = rb_entry(node, struct vfio_pfn, node); >> + >> + if (end_iova <= vpfn->iova) >> + node = node->rb_left; >> + else if (iova > vpfn->iova) >> + node = node->rb_right; >> + else >> + return vpfn; >> + } >> + return NULL; >> +} > >This essentially duplicates vfio_find_vpfn(), where the existing >function only finds a single page. The existing function should be >extended for this new use case and callers updated. Also the vfio_pfn How about implementing it in this way? static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova) { return vfio_find_vpfn_range(dma, iova, 1); } With this implement, the caller who calls vfio_find_vpfn() won't need to be modified. >is not "random", it's the first vfio_pfn overlapping the range. Yes you are right. I will modify the comments to "Find the first vfio_pfn overlapping the range [iova, iova + PAGE_SIZE * npage) in rb tree" in v2. >> + >> static void vfio_link_pfn(struct vfio_dma *dma, >> struct vfio_pfn *new) >> { >> @@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, >> iova += (PAGE_SIZE * ret); >> continue; >> } >> + > >Spurious new blank line. > >> + } > >A new blank line here would be appreciated. Thank you. I will address this in v2. >> + /* Handle hugetlbfs page */ >> + if (likely(!disable_hugepages) && > >Isn't this already accounted for with npage = 1? Yes. This check is not necessary. I will remove it in v2. >> + folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) { > >I don't follow how this guarantees the entire batch->size is >contiguous. Isn't it possible that a batch could contain multiple >hugetlb folios? Is the assumption here only true if folio_nr_pages() >(or specifically the pages remaining) is >= batch->capacity? What Yes. >happens if we try to map the last half of one 2MB hugetlb page and >first half of the non-physically-contiguous next page? Or what if the >hugetlb size is 64KB and the batch contains multiple folios that are >not physically contiguous? Sorry for my problematic implementation. I will fix it in v2. >> + if (pfn != *pfn_base + pinned) >> + goto out; >> + >> + if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) { >> + if (!dma->lock_cap && >> + mm->locked_vm + lock_acct + batch->size > limit) { >> + pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", >> + __func__, limit << PAGE_SHIFT); >> + ret = -ENOMEM; >> + goto unpin_out; >> + } >> + pinned += batch->size; >> + npage -= batch->size; >> + vaddr += PAGE_SIZE * batch->size; >> + iova += PAGE_SIZE * batch->size; >> + lock_acct += batch->size; >> + batch->offset += batch->size; >> + batch->size = 0; >> + continue; >> + } > >There's a lot of duplication with the existing page-iterative loop. I >think they could be consolidated if we extract the number of known >contiguous pages based on the folio into a variable, default 1. Good idea! I will try to implement this optimization based on this idea in v2. >Also, while this approach is an improvement, it leaves a lot on the >table in scenarios where folio_nr_pages() exceeds batch->capacity. For >example we're at best incrementing 1GB hugetlb pages in 2MB increments. >We're also wasting a lot of cycles to fill pages points we mostly don't >use. Thanks, Yes, this is the ideal case. However, we are uncertain whether the pages to be pinned are hugetlbfs folio, and increasing batch->capacity would lead to a significant increase in memory usage. Additionally, since pin_user_pages_remote() currently lacks a variant that can reduce the overhead of "filling pages points" of hugetlbfs folio (maybe not correct). I suggest we proceeding with additional optimizations after further infrastructure improvements.