On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote: > From: Yan Zhao <yan.y.zhao@xxxxxxxxx> > > Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd > doesn't support page migration in any capacity, i.e. there are no migrate > callbacks because guest_memfd pages *can't* be migrated. See the WARN in > kvm_gmem_migrate_folio(). > > Eliminating TDX's explicit pinning will also enable guest_memfd to support > in-place conversion between shared and private memory[1][2]. Because KVM > cannot distinguish between speculative/transient refcounts and the > intentional refcount for TDX on private pages[3], failing to release > private page refcount in TDX could cause guest_memfd to indefinitely wait > on decreasing the refcount for the splitting. > > Under normal conditions, not holding an extra page refcount in TDX is safe > because guest_memfd ensures pages are retained until its invalidation > notification to KVM MMU is completed. However, if there're bugs in KVM/TDX > module, not holding an extra refcount when a page is mapped in S-EPT could > result in a page being released from guest_memfd while still mapped in the > S-EPT. But, doing work to make a fatal error slightly less fatal is a net > negative when that extra work adds complexity and confusion. > > Several approaches were considered to address the refcount issue, including > - Attempting to modify the KVM unmap operation to return a failure, > which was deemed too complex and potentially incorrect[4]. > - Increasing the folio reference count only upon S-EPT zapping failure[5]. > - Use page flags or page_ext to indicate a page is still used by TDX[6], > which does not work for HVO (HugeTLB Vmemmap Optimization). > - Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7]. > > Due to the complexity or inappropriateness of these approaches, and the > fact that S-EPT zapping failure is currently only possible when there are > bugs in the KVM or TDX module, which is very rare in a production kernel, > a straightforward approach of simply not holding the page reference count > in TDX was chosen[8]. > > When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all > vCPUs and mark the VM as dead. Although there is a potential window that a > private page mapped in the S-EPT could be reallocated and used outside the > VM, the loud warning from KVM_BUG_ON() should provide sufficient debug > information. > Yea, in the case of a bug, there could be a use-after-free. This logic applies to all code that has allocations including the entire KVM MMU. But in this case, we can actually catch the use-after-free scenario under scrutiny and not have it happen silently, which does not apply to all code. But the special case here is that the use-after-free depends on TDX module logic which is not part of the kernel. Yan, can you clarify what you mean by "there could be a small window"? I'm thinking this is a hypothetical window around vm_dead races? Or more concrete? I *don't* want to re-open the debate on whether to go with this approach, but I think this is a good teaching edge case to settle on how we want to treat similar issues. So I just want to make sure we have the justification right. > To be robust against bugs, the user can enable panic_on_warn > as normal. > > Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@xxxxxxxxxx [1] > Link: https://youtu.be/UnBKahkAon4 [2] > Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@xxxxxxxxxxxxxx [3] > Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@xxxxxxxxxxxxxxxxxxxxxxxxx [4] > Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@xxxxxxxxxxxxxxxxxxxxxxxxx [5] > Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@xxxxxxxxxxxxxxxxxxxxxxxxx [6] > Link: https://lore.kernel.org/all/diqzy0tikran.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [7] > Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@xxxxxxxxx [8] > Suggested-by: Vishal Annapurve <vannapurve@xxxxxxxxxx> > Suggested-by: Ackerley Tng <ackerleytng@xxxxxxxxxx> > Suggested-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Reviewed-by: Ira Weiny <ira.weiny@xxxxxxxxx> > Reviewed-by: Kai Huang <kai.huang@xxxxxxxxx> > [sean: extract out of hugepage series, massage changelog accordingly] > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- Discussion aside, Reviewed-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>