Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Thu, 5 Jun 2025 10:42:45 +0800

On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@xxxxxxxxx> writes:
> 
> > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote:
> >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> >> > ...
> >> > >
> >> > > I might be wrongly throwing out some terminologies here then.
> >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs.
> >> > > udmabuf seems to be working with pinned "folios" in the backend.
> >> > >
> >> > > The goal is to get to a stage where guest_memfd is backed by pfn
> >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to
> >> > > userspace, KVM, IOMMU subject to shareability attributes. if the
> >> > OK. So from point of the reset part of kernel, those pfns are not regarded as
> >> > memory.
> >> >
> >> > > shareability changes, the users will get notified and will have to
> >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges
> >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of
> >> > > special handling/lack of page structs.
> >> > My concern is a failable invalidation notifer may not be ideal.
> >> > Instead of relying on ref counts (or other mechanisms) to determine whether to
> >> > start shareabilitiy changes, with a failable invalidation notifier, some users
> >> > may fail the invalidation and the shareability change, even after other users
> >> > have successfully unmapped a range.
> >>
> >> Even if one user fails to invalidate its mappings, I don't see a
> >> reason to go ahead with shareability change. Shareability should not
> >> change unless all existing users let go of their soon-to-be-invalid
> >> view of memory.
> 
> Hi Yan,
> 
> While working on the 1G (aka HugeTLB) page support for guest_memfd
> series [1], we took into account conversion failures too. The steps are
> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> series from GitHub [2] because the steps for conversion changed in two
> separate patches.)
> 
> We do need to handle errors across ranges to be converted, possibly from
> different memslots. The goal is to either have the entire conversion
> happen (including page split/merge) or nothing at all when the ioctl
> returns.
> 
> We try to undo the restructuring (whether split or merge) and undo any
> shareability changes on error (barring ENOMEM, in which case we leave a
> WARNing).
As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it
can lead to WARNings in kernel with folios not being properly added to the
filemap.

> The part we don't restore is the presence of the pages in the host or
> guest page tables. For that, our idea is that if unmapped, the next
> access will just map it in, so there's no issue there.

I don't think so.

As in patch 38 in [1], on failure, it may fail to
- restore the shareability
- restore the folio's filemap status
- restore the folio's hugetlb stash metadata
- restore the folio's merged/split status

Also, the host page table is not restored.

> > My thinking is that:
> >
> > 1. guest_memfd starts shared-to-private conversion
> > 2. guest_memfd sends invalidation notifications
> >    2.1 invalidate notification --> A --> Unmap and return success
> >    2.2 invalidate notification --> B --> Unmap and return success
> >    2.3 invalidate notification --> C --> return failure
> > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps
> >    shareability as shared
> >
> > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and
> > 2.2. Even if additional notifications could be sent to A and B to ask for
> > mapping the GFN back, the map operation might fail. Consequently, A and B might
> > not be able to restore the mapped status of the GFN.
> 
> For conversion we don't attempt to restore mappings anywhere (whether in
> guest or host page tables). What do you think of not restoring the
> mappings?
It could cause problem if the mappings in S-EPT can't be restored.

For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() -->
kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then
the GFN shareability is restored to private. The next guest access to
the partially unmapped private memory can meet a fatal error: "access before
acceptance".

It could occur in such a scenario:
1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared
2. Conversion fails in KVM.
3. set_memory_decrypted() fails in TD.
4. TD thinks the GFN is still accepted as private and accesses it.

> > For IOMMU mappings, this
> > could result in DMAR failure following a failed attempt to do shared-to-private
> > conversion.
> 
> I believe the current conversion setup guards against this because after
> unmapping from the host, we check for any unexpected refcounts.
Right, it's fine if we check for any unexpected refcounts.

> (This unmapping is not the unmapping we're concerned about, since this is
> shared memory, and unmapping doesn't go through TDX.)
> 
> Coming back to the refcounts, if the IOMMU had mappings, these refcounts
> are "unexpected". The conversion ioctl will return to userspace with an
> error.
> 
> IO can continue to happen, since the memory is still mapped in the
> IOMMU. The memory state is still shared. No issue there.
> 
> In RFCv2 [1], we expect userspace to see the error, then try and remove
> the memory from the IOMMU, and then try conversion again.
I don't think it's right to depend on that userspace could always perform in 
kernel's expected way, i.e. trying conversion until it succeeds.

We need to restore to the previous status (which includes the host page table)
if conversion can't be done.
That said, in my view, a better flow would be:

1. guest_memfd sends a pre-invalidation request to users (users here means the
   consumers in kernel of memory allocated from guest_memfd).

2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can
   proceed. For example, in the case of TDX, this might involve memory
   allocation and page splitting.

3. Based on the pre-check results, guest_memfd either aborts the invalidation or
   proceeds by sending the actual invalidation request.

4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For
   TDX, the unmap must succeed unless there are bugs in the KVM or TDX module.
   In such cases, TDX can callback guest_memfd to inform the poison-status of
   the page or elevate the page reference count.

5. guest_memfd completes the invalidation process. If the memory is marked as
   "poison," guest_memfd can handle it accordingly. If the page has an elevated
   reference count, guest_memfd may not need to take special action, as the
   elevated count prevents the OS from reallocating the page.
   (but from your reply below, seems a callback to guest_memfd is a better
   approach).

> The part in concern here is unmapping failures of private pages, for
> private-to-shared conversions, since that part goes through TDX and
> might fail.
IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM
or TDX module.
So, for page splitting in S-EPT, I prefer to try splitting in the
pre-invalidation phase before conducting any real unmap.

> One other thing about taking refcounts is that in RFCv2,
> private-to-shared conversions assume that there are no refcounts on the
> private pages at all. (See filemap_remove_folio_for_restructuring() in
> [3])
>
> Haven't had a chance to think about all the edge cases, but for now I
> think on unmapping failure, in addition to taking a refcount, we should
> return an error at least up to guest_memfd, so that guest_memfd could
> perhaps keep the refcount on that page, but drop the page from the
> filemap. Another option could be to track messed up addresses and always
> check that on conversion or something - not sure yet.

It looks good to me. See the bullet 4 in my proposed flow above.

> Either way, guest_memfd must know. If guest_memfd is not informed, on a
> next conversion request, the conversion will just spin in
> filemap_remove_folio_for_restructuring().
It makes sense.

> What do you think of this part about informing guest_memfd of the
> failure to unmap?
So, do you want to add a guest_memfd callback to achieve this purpose?

BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range()
and mmu_notifier_invalidate_range_start() fail, based on the repo
https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag
'hardening-v6.16-rc1-fix1-take2' of
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux")

1. Status of mmu notifier
-------------------------------
(1) There're 34 direct callers of mmu_notifier_invalidate_range_start().
    1. clear_refs_write
    2. do_pagemap_scan
    3. uprobe_write_opcode
    4. do_huge_zero_wp_pmd
    5. __split_huge_pmd (N)
    6. __split_huge_pud (N)
    7. move_pages_huge_pmd
    8. copy_hugetlb_page_range
    9. hugetlb_unshare_pmds  (N)
    10. hugetlb_change_protection
    11. hugetlb_wp
    12. unmap_hugepage_range (N)
    13. move_hugetlb_page_tables
    14. collapse_huge_page
    15. retract_page_tables
    16. collapse_pte_mapped_thp
    17. write_protect_page
    18. replace_page
    19. madvise_free_single_vma
    20. wp_clean_pre_vma
    21. wp_page_copy 
    22. zap_page_range_single_batched (N)
    23. unmap_vmas (N)
    24. copy_page_range 
    25. remove_device_exclusive_entry
    26. migrate_vma_collect
    27. __migrate_device_pages
    28. change_pud_range 
    29. move_page_tables
    30. page_vma_mkclean_one
    31. try_to_unmap_one
    32. try_to_migrate_one
    33. make_device_exclusive
    34. move_pages_pte

Of these 34 direct callers, those marked with (N) cannot tolerate
mmu_notifier_invalidate_range_start() failing. I have not yet investigated all
34 direct callers one by one, so the list of (N) is incomplete.

For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says:
"Note that split_huge_pmd() doesn't have any limitations on refcounting:
pmd can be split at any point and never fails." This is because split_huge_pmd()
serves as a graceful fallback design for code walking pagetables but unaware
about huge pmds.

(2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(),
__oom_reap_task_mm(), which only expects the error -EAGAIN.

In mn_hlist_invalidate_range_start():
"WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);"

(3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that
case, they don't need to register mmu notifier.

Or, device drivers can pin pages via get_user_pages*(), and register for mmu         
notifier callbacks for the memory range. Then, upon receiving a notifier         
"invalidate range" callback , stop the device from using the range, and unpin    
the pages.

See Documentation/core-api/pin_user_pages.rst.

2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start()
-------------------------------
(1) Error fallback cases.

    1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst.
       split_huge_pmd() is designed as a graceful fallback without failure.

       split_huge_pmd
        |->__split_huge_pmd
           |->mmu_notifier_range_init
           |  mmu_notifier_invalidate_range_start
           |  split_huge_pmd_locked
           |  mmu_notifier_invalidate_range_end

    2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling.
       iomap_write_failed
         |->truncate_pagecache_range
            |->unmap_mapping_range
            |  |->unmap_mapping_pages
            |     |->unmap_mapping_range_tree
            |        |->unmap_mapping_range_vma
            |           |->zap_page_range_single
            |              |->zap_page_range_single_batched
            |                       |->mmu_notifier_range_init
            |                       |  mmu_notifier_invalidate_range_start
            |                       |  unmap_single_vma
            |                       |  mmu_notifier_invalidate_range_end
            |->truncate_inode_pages_range
               |->truncate_cleanup_folio
                  |->if (folio_mapped(folio))
                  |     unmap_mapping_folio(folio);
                         |->unmap_mapping_range_tree
                            |->unmap_mapping_range_vma
                               |->zap_page_range_single
                                  |->zap_page_range_single_batched
                                     |->mmu_notifier_range_init
                                     |  mmu_notifier_invalidate_range_start
                                     |  unmap_single_vma
                                     |  mmu_notifier_invalidate_range_end

   3. in mm/memory.c, zap_page_range_single() is invoked to handle error.
      remap_pfn_range_notrack
        |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
        |  if (!error)
        |      return 0;
	|  zap_page_range_single
           |->zap_page_range_single_batched
              |->mmu_notifier_range_init
              |  mmu_notifier_invalidate_range_start
              |  unmap_single_vma
              |  mmu_notifier_invalidate_range_end

   4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any
      partial mappings on error.

      perf_mmap
        |->ret = map_range(rb, vma);
                 |  err = remap_pfn_range
                 |->if (err) 
                 |     zap_page_range_single
                        |->zap_page_range_single_batched
                           |->mmu_notifier_range_init
                           |  mmu_notifier_invalidate_range_start
                           |  unmap_single_vma
                           |  mmu_notifier_invalidate_range_end

   5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page.

      __do_fault
	|->if (unlikely(PageHWPoison(vmf->page))) { 
	|	vm_fault_t poisonret = VM_FAULT_HWPOISON;
	|	if (ret & VM_FAULT_LOCKED) {
	|		if (page_mapped(vmf->page))
	|			unmap_mapping_folio(folio);
        |                       |->unmap_mapping_range_tree
        |                          |->unmap_mapping_range_vma
        |                             |->zap_page_range_single
        |                                |->zap_page_range_single_batched
        |                                   |->mmu_notifier_range_init
        |                                   |  mmu_notifier_invalidate_range_start
        |                                   |  unmap_single_vma
        |                                   |  mmu_notifier_invalidate_range_end
	|		if (mapping_evict_folio(folio->mapping, folio))
	|			poisonret = VM_FAULT_NOPAGE; 
	|		folio_unlock(folio);
	|	}
	|	folio_put(folio);
	|	vmf->page = NULL;
	|	return poisonret;
	|  }

  6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any
     partial mapping done by a device driver.

     __mmap_new_vma
       |->__mmap_new_file_vma(map, vma);
          |->error = mmap_file(vma->vm_file, vma);
          |  if (error)
          |     unmap_region
                 |->unmap_vmas
                    |->mmu_notifier_range_init
                    |  mmu_notifier_invalidate_range_start
                    |  unmap_single_vma
                    |  mmu_notifier_invalidate_range_end

(2) No-fail cases
-------------------------------
1. iput() cannot fail. 

iput
 |->iput_final
    |->WRITE_ONCE(inode->i_state, state | I_FREEING);
    |  inode_lru_list_del(inode);
    |  evict(inode);
       |->op->evict_inode(inode);
          |->shmem_evict_inode
             |->shmem_truncate_range
                |->truncate_inode_pages_range
                   |->truncate_cleanup_folio
                      |->if (folio_mapped(folio))
                      |     unmap_mapping_folio(folio);
                            |->unmap_mapping_range_tree
                               |->unmap_mapping_range_vma
                                  |->zap_page_range_single
                                     |->zap_page_range_single_batched
                                        |->mmu_notifier_range_init
                                        |  mmu_notifier_invalidate_range_start
                                        |  unmap_single_vma
                                        |  mmu_notifier_invalidate_range_end

2. exit_mmap() cannot fail

exit_mmap
  |->mmu_notifier_release(mm);
     |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
        |->mmu_notifier_range_init
        |  mmu_notifier_invalidate_range_start
        |  unmap_single_vma
        |  mmu_notifier_invalidate_range_end

3. KVM Cases That Cannot Tolerate Unmap Failure
-------------------------------
Allowing unmap operations to fail in the following scenarios would make it very
difficult or even impossible to handle the failure:

(1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page
without expecting any failure.

mmu_alloc_direct_roots
  |->mmu_alloc_root
     |->kvm_mmu_get_shadow_page
        |->__kvm_mmu_get_shadow_page
           |->kvm_mmu_alloc_shadow_page
              |->account_shadowed
                 |->kvm_mmu_slot_gfn_write_protect
                    |->kvm_tdp_mmu_write_protect_gfn
                       |->write_protect_gfn
                          |->tdp_mmu_iter_set_spte

(2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail

kvm_vfio_release/kvm_vfio_file_del
 |->kvm_vfio_update_coherency
    |->kvm_arch_unregister_noncoherent_dma
       |->kvm_noncoherent_dma_assignment_start_or_stop
          |->kvm_zap_gfn_range
             |->kvm_tdp_mmu_zap_leafs
                |->tdp_mmu_zap_leafs
                   |->tdp_mmu_iter_set_spte

(3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently
never expect failure of unmap.

__kvm_set_or_clear_apicv_inhibit
  |->kvm_zap_gfn_range
     |->kvm_tdp_mmu_zap_leafs
        |->tdp_mmu_zap_leafs
           |->tdp_mmu_iter_set_spte

4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with
write mmu_lock) failable.

(1) kvm_vcpu_flush_tlb_guest()

kvm_vcpu_flush_tlb_guest
  |->kvm_mmu_sync_roots
     |->mmu_sync_children
        |->kvm_vcpu_write_protect_gfn
           |->kvm_mmu_slot_gfn_write_protect
              |->kvm_tdp_mmu_write_protect_gfn
                 |->write_protect_gfn
                    |->tdp_mmu_iter_set_spte
                       |->tdp_mmu_set_spte

(2) handle_removed_pt() and handle_changed_spte().

Thanks
Yan