On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote: > Yan Zhao <yan.y.zhao@xxxxxxxxx> writes: > > > On Mon, May 12, 2025 at 09:53:43AM -0700, Vishal Annapurve wrote: > >> On Sun, May 11, 2025 at 7:18 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > >> > ... > >> > > > >> > > I might be wrongly throwing out some terminologies here then. > >> > > VM_PFNMAP flag can be set for memory backed by folios/page structs. > >> > > udmabuf seems to be working with pinned "folios" in the backend. > >> > > > >> > > The goal is to get to a stage where guest_memfd is backed by pfn > >> > > ranges unmanaged by kernel that guest_memfd owns and distributes to > >> > > userspace, KVM, IOMMU subject to shareability attributes. if the > >> > OK. So from point of the reset part of kernel, those pfns are not regarded as > >> > memory. > >> > > >> > > shareability changes, the users will get notified and will have to > >> > > invalidate their mappings. guest_memfd will allow mmaping such ranges > >> > > with VM_PFNMAP flag set by default in the VMAs to indicate the need of > >> > > special handling/lack of page structs. > >> > My concern is a failable invalidation notifer may not be ideal. > >> > Instead of relying on ref counts (or other mechanisms) to determine whether to > >> > start shareabilitiy changes, with a failable invalidation notifier, some users > >> > may fail the invalidation and the shareability change, even after other users > >> > have successfully unmapped a range. > >> > >> Even if one user fails to invalidate its mappings, I don't see a > >> reason to go ahead with shareability change. Shareability should not > >> change unless all existing users let go of their soon-to-be-invalid > >> view of memory. > > Hi Yan, > > While working on the 1G (aka HugeTLB) page support for guest_memfd > series [1], we took into account conversion failures too. The steps are > in kvm_gmem_convert_range(). (It might be easier to pull the entire > series from GitHub [2] because the steps for conversion changed in two > separate patches.) > > We do need to handle errors across ranges to be converted, possibly from > different memslots. The goal is to either have the entire conversion > happen (including page split/merge) or nothing at all when the ioctl > returns. > > We try to undo the restructuring (whether split or merge) and undo any > shareability changes on error (barring ENOMEM, in which case we leave a > WARNing). As the undo can fail (as the case you leave a WARNing, in patch 38 in [1]), it can lead to WARNings in kernel with folios not being properly added to the filemap. > The part we don't restore is the presence of the pages in the host or > guest page tables. For that, our idea is that if unmapped, the next > access will just map it in, so there's no issue there. I don't think so. As in patch 38 in [1], on failure, it may fail to - restore the shareability - restore the folio's filemap status - restore the folio's hugetlb stash metadata - restore the folio's merged/split status Also, the host page table is not restored. > > My thinking is that: > > > > 1. guest_memfd starts shared-to-private conversion > > 2. guest_memfd sends invalidation notifications > > 2.1 invalidate notification --> A --> Unmap and return success > > 2.2 invalidate notification --> B --> Unmap and return success > > 2.3 invalidate notification --> C --> return failure > > 3. guest_memfd finds 2.3 fails, fails shared-to-private conversion and keeps > > shareability as shared > > > > Though the GFN remains shared after 3, it's unmapped in user A and B in 2.1 and > > 2.2. Even if additional notifications could be sent to A and B to ask for > > mapping the GFN back, the map operation might fail. Consequently, A and B might > > not be able to restore the mapped status of the GFN. > > For conversion we don't attempt to restore mappings anywhere (whether in > guest or host page tables). What do you think of not restoring the > mappings? It could cause problem if the mappings in S-EPT can't be restored. For TDX private-to-shared conversion, if kvm_gmem_convert_should_proceed() --> kvm_gmem_unmap_private() --> kvm_mmu_unmap_gfn_range() fails in the end, then the GFN shareability is restored to private. The next guest access to the partially unmapped private memory can meet a fatal error: "access before acceptance". It could occur in such a scenario: 1. TD issues a TDVMCALL_MAP_GPA to convert a private GFN to shared 2. Conversion fails in KVM. 3. set_memory_decrypted() fails in TD. 4. TD thinks the GFN is still accepted as private and accesses it. > > For IOMMU mappings, this > > could result in DMAR failure following a failed attempt to do shared-to-private > > conversion. > > I believe the current conversion setup guards against this because after > unmapping from the host, we check for any unexpected refcounts. Right, it's fine if we check for any unexpected refcounts. > (This unmapping is not the unmapping we're concerned about, since this is > shared memory, and unmapping doesn't go through TDX.) > > Coming back to the refcounts, if the IOMMU had mappings, these refcounts > are "unexpected". The conversion ioctl will return to userspace with an > error. > > IO can continue to happen, since the memory is still mapped in the > IOMMU. The memory state is still shared. No issue there. > > In RFCv2 [1], we expect userspace to see the error, then try and remove > the memory from the IOMMU, and then try conversion again. I don't think it's right to depend on that userspace could always perform in kernel's expected way, i.e. trying conversion until it succeeds. We need to restore to the previous status (which includes the host page table) if conversion can't be done. That said, in my view, a better flow would be: 1. guest_memfd sends a pre-invalidation request to users (users here means the consumers in kernel of memory allocated from guest_memfd). 2. Users (A, B, ..., X) perform pre-checks to determine if invalidation can proceed. For example, in the case of TDX, this might involve memory allocation and page splitting. 3. Based on the pre-check results, guest_memfd either aborts the invalidation or proceeds by sending the actual invalidation request. 4. Users (A-X) perform the actual unmap operation, ensuring it cannot fail. For TDX, the unmap must succeed unless there are bugs in the KVM or TDX module. In such cases, TDX can callback guest_memfd to inform the poison-status of the page or elevate the page reference count. 5. guest_memfd completes the invalidation process. If the memory is marked as "poison," guest_memfd can handle it accordingly. If the page has an elevated reference count, guest_memfd may not need to take special action, as the elevated count prevents the OS from reallocating the page. (but from your reply below, seems a callback to guest_memfd is a better approach). > The part in concern here is unmapping failures of private pages, for > private-to-shared conversions, since that part goes through TDX and > might fail. IMO, even for TDX, the real unmap must not fail unless there are bugs in the KVM or TDX module. So, for page splitting in S-EPT, I prefer to try splitting in the pre-invalidation phase before conducting any real unmap. > One other thing about taking refcounts is that in RFCv2, > private-to-shared conversions assume that there are no refcounts on the > private pages at all. (See filemap_remove_folio_for_restructuring() in > [3]) > > Haven't had a chance to think about all the edge cases, but for now I > think on unmapping failure, in addition to taking a refcount, we should > return an error at least up to guest_memfd, so that guest_memfd could > perhaps keep the refcount on that page, but drop the page from the > filemap. Another option could be to track messed up addresses and always > check that on conversion or something - not sure yet. It looks good to me. See the bullet 4 in my proposed flow above. > Either way, guest_memfd must know. If guest_memfd is not informed, on a > next conversion request, the conversion will just spin in > filemap_remove_folio_for_restructuring(). It makes sense. > What do you think of this part about informing guest_memfd of the > failure to unmap? So, do you want to add a guest_memfd callback to achieve this purpose? BTW, here's an analysis of why we can't let kvm_mmu_unmap_gfn_range() and mmu_notifier_invalidate_range_start() fail, based on the repo https://github.com/torvalds/linux.git, commit cd2e103d57e5 ("Merge tag 'hardening-v6.16-rc1-fix1-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux") 1. Status of mmu notifier ------------------------------- (1) There're 34 direct callers of mmu_notifier_invalidate_range_start(). 1. clear_refs_write 2. do_pagemap_scan 3. uprobe_write_opcode 4. do_huge_zero_wp_pmd 5. __split_huge_pmd (N) 6. __split_huge_pud (N) 7. move_pages_huge_pmd 8. copy_hugetlb_page_range 9. hugetlb_unshare_pmds (N) 10. hugetlb_change_protection 11. hugetlb_wp 12. unmap_hugepage_range (N) 13. move_hugetlb_page_tables 14. collapse_huge_page 15. retract_page_tables 16. collapse_pte_mapped_thp 17. write_protect_page 18. replace_page 19. madvise_free_single_vma 20. wp_clean_pre_vma 21. wp_page_copy 22. zap_page_range_single_batched (N) 23. unmap_vmas (N) 24. copy_page_range 25. remove_device_exclusive_entry 26. migrate_vma_collect 27. __migrate_device_pages 28. change_pud_range 29. move_page_tables 30. page_vma_mkclean_one 31. try_to_unmap_one 32. try_to_migrate_one 33. make_device_exclusive 34. move_pages_pte Of these 34 direct callers, those marked with (N) cannot tolerate mmu_notifier_invalidate_range_start() failing. I have not yet investigated all 34 direct callers one by one, so the list of (N) is incomplete. For 5. __split_huge_pmd(), Documentation/mm/transhuge.rst says: "Note that split_huge_pmd() doesn't have any limitations on refcounting: pmd can be split at any point and never fails." This is because split_huge_pmd() serves as a graceful fallback design for code walking pagetables but unaware about huge pmds. (2) There's 1 direct caller of mmu_notifier_invalidate_range_start_nonblock(), __oom_reap_task_mm(), which only expects the error -EAGAIN. In mn_hlist_invalidate_range_start(): "WARN_ON(mmu_notifier_range_blockable(range) || _ret != -EAGAIN);" (3) For DMAs, drivers need to invoke pin_user_pages() to pin memory. In that case, they don't need to register mmu notifier. Or, device drivers can pin pages via get_user_pages*(), and register for mmu notifier callbacks for the memory range. Then, upon receiving a notifier "invalidate range" callback , stop the device from using the range, and unpin the pages. See Documentation/core-api/pin_user_pages.rst. 2. Cases that cannot tolerate failure of mmu_notifier_invalidate_range_start() ------------------------------- (1) Error fallback cases. 1. split_huge_pmd() as mentioned in Documentation/mm/transhuge.rst. split_huge_pmd() is designed as a graceful fallback without failure. split_huge_pmd |->__split_huge_pmd |->mmu_notifier_range_init | mmu_notifier_invalidate_range_start | split_huge_pmd_locked | mmu_notifier_invalidate_range_end 2. in fs/iomap/buffered-io.c, iomap_write_failed() itself is error handling. iomap_write_failed |->truncate_pagecache_range |->unmap_mapping_range | |->unmap_mapping_pages | |->unmap_mapping_range_tree | |->unmap_mapping_range_vma | |->zap_page_range_single | |->zap_page_range_single_batched | |->mmu_notifier_range_init | | mmu_notifier_invalidate_range_start | | unmap_single_vma | | mmu_notifier_invalidate_range_end |->truncate_inode_pages_range |->truncate_cleanup_folio |->if (folio_mapped(folio)) | unmap_mapping_folio(folio); |->unmap_mapping_range_tree |->unmap_mapping_range_vma |->zap_page_range_single |->zap_page_range_single_batched |->mmu_notifier_range_init | mmu_notifier_invalidate_range_start | unmap_single_vma | mmu_notifier_invalidate_range_end 3. in mm/memory.c, zap_page_range_single() is invoked to handle error. remap_pfn_range_notrack |->int error = remap_pfn_range_internal(vma, addr, pfn, size, prot); | if (!error) | return 0; | zap_page_range_single |->zap_page_range_single_batched |->mmu_notifier_range_init | mmu_notifier_invalidate_range_start | unmap_single_vma | mmu_notifier_invalidate_range_end 4. in kernel/events/core.c, zap_page_range_single() is invoked to clear any partial mappings on error. perf_mmap |->ret = map_range(rb, vma); | err = remap_pfn_range |->if (err) | zap_page_range_single |->zap_page_range_single_batched |->mmu_notifier_range_init | mmu_notifier_invalidate_range_start | unmap_single_vma | mmu_notifier_invalidate_range_end 5. in mm/memory.c, unmap_mapping_folio() is invoked to unmap posion page. __do_fault |->if (unlikely(PageHWPoison(vmf->page))) { | vm_fault_t poisonret = VM_FAULT_HWPOISON; | if (ret & VM_FAULT_LOCKED) { | if (page_mapped(vmf->page)) | unmap_mapping_folio(folio); | |->unmap_mapping_range_tree | |->unmap_mapping_range_vma | |->zap_page_range_single | |->zap_page_range_single_batched | |->mmu_notifier_range_init | | mmu_notifier_invalidate_range_start | | unmap_single_vma | | mmu_notifier_invalidate_range_end | if (mapping_evict_folio(folio->mapping, folio)) | poisonret = VM_FAULT_NOPAGE; | folio_unlock(folio); | } | folio_put(folio); | vmf->page = NULL; | return poisonret; | } 6. in mm/vma.c, in __mmap_region(), unmap_region() is invoked to undo any partial mapping done by a device driver. __mmap_new_vma |->__mmap_new_file_vma(map, vma); |->error = mmap_file(vma->vm_file, vma); | if (error) | unmap_region |->unmap_vmas |->mmu_notifier_range_init | mmu_notifier_invalidate_range_start | unmap_single_vma | mmu_notifier_invalidate_range_end (2) No-fail cases ------------------------------- 1. iput() cannot fail. iput |->iput_final |->WRITE_ONCE(inode->i_state, state | I_FREEING); | inode_lru_list_del(inode); | evict(inode); |->op->evict_inode(inode); |->shmem_evict_inode |->shmem_truncate_range |->truncate_inode_pages_range |->truncate_cleanup_folio |->if (folio_mapped(folio)) | unmap_mapping_folio(folio); |->unmap_mapping_range_tree |->unmap_mapping_range_vma |->zap_page_range_single |->zap_page_range_single_batched |->mmu_notifier_range_init | mmu_notifier_invalidate_range_start | unmap_single_vma | mmu_notifier_invalidate_range_end 2. exit_mmap() cannot fail exit_mmap |->mmu_notifier_release(mm); |->unmap_vmas(&tlb, &vmi.mas, vma, 0, ULONG_MAX, ULONG_MAX, false); |->mmu_notifier_range_init | mmu_notifier_invalidate_range_start | unmap_single_vma | mmu_notifier_invalidate_range_end 3. KVM Cases That Cannot Tolerate Unmap Failure ------------------------------- Allowing unmap operations to fail in the following scenarios would make it very difficult or even impossible to handle the failure: (1) __kvm_mmu_get_shadow_page() is designed to reliably obtain a shadow page without expecting any failure. mmu_alloc_direct_roots |->mmu_alloc_root |->kvm_mmu_get_shadow_page |->__kvm_mmu_get_shadow_page |->kvm_mmu_alloc_shadow_page |->account_shadowed |->kvm_mmu_slot_gfn_write_protect |->kvm_tdp_mmu_write_protect_gfn |->write_protect_gfn |->tdp_mmu_iter_set_spte (2) kvm_vfio_release() and kvm_vfio_file_del() cannot fail kvm_vfio_release/kvm_vfio_file_del |->kvm_vfio_update_coherency |->kvm_arch_unregister_noncoherent_dma |->kvm_noncoherent_dma_assignment_start_or_stop |->kvm_zap_gfn_range |->kvm_tdp_mmu_zap_leafs |->tdp_mmu_zap_leafs |->tdp_mmu_iter_set_spte (3) There're lots of callers of __kvm_set_or_clear_apicv_inhibit() currently never expect failure of unmap. __kvm_set_or_clear_apicv_inhibit |->kvm_zap_gfn_range |->kvm_tdp_mmu_zap_leafs |->tdp_mmu_zap_leafs |->tdp_mmu_iter_set_spte 4. Cases in KVM where it's hard to make tdp_mmu_set_spte() (update SPTE with write mmu_lock) failable. (1) kvm_vcpu_flush_tlb_guest() kvm_vcpu_flush_tlb_guest |->kvm_mmu_sync_roots |->mmu_sync_children |->kvm_vcpu_write_protect_gfn |->kvm_mmu_slot_gfn_write_protect |->kvm_tdp_mmu_write_protect_gfn |->write_protect_gfn |->tdp_mmu_iter_set_spte |->tdp_mmu_set_spte (2) handle_removed_pt() and handle_changed_spte(). Thanks Yan