[RFC PATCH 00/21] KVM: TDX huge page support for private memory

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Thu, 24 Apr 2025 11:00:32 +0800

This is an RFC series to support huge pages in TDX. It's an evolution of
the previous patches from Isaku [0]. (Please find the main changes to [0]
at a later section).

As the series of enabling guest_memfd to support 1GB huge page with
in-place conversion [1] is still under development, we temporarily based
the TDX work on top of the series from Michael Roth that enables basic 2M
guest_memfd support without in-place conversion[2].  The goal is to have an
early review and discussion of the TDX huge page work (including changes to
KVM core MMU and the TDX specific code), which should remain stable, with
only minor adjustments, regardless the changes coming in guest_memfd.

The series is currently focused on supporting 2MB huge pages only.

Tip folks, there are some SEAMCALL wrapper changes in this series, but we
need to have some discussion on the KVM side to figure out what it needs
still. Please feel free to ignore it for now.

Design
======
guest_memfd
-----------
TDX huge page support has a basic assumption to guest_memfd: guest_memfd
allocates private huge pages whenever alignment of GFN/index, range size
and the consistency of page attributes allow.

Patch 01 (based on [2]) in this RFC acts as glue code to ensure this
assumption is met for TDX. It can be absorbed into any future
guest_memfd series (e.g., future in-place conversion series) in any form.

TDX interacts with guest_memfd through interfaces kvm_gmem_populate() and
kvm_gmem_get_pfn(), obtaining the allocated page and its order.

The remaining TDX code should remain stable despite future changes in
guest_memfd.

Basic huge page mapping/unmapping
---------------------------------
- TD build time
  This series enforces that all private mappings be 4KB during the TD build
  phase, due to the TDX module's requirement that tdh_mem_page_add(), the
  SEAMCALL for adding private pages during TD build time, only supports 4KB
  mappings. Enforcing 4KB mappings also simplifies the implementation of
  code for TD build time, by eliminating the need to consider merging or
  splitting in the mirror page table during TD build time.

  The underlying pages allocated from guest_memfd during TD build time
  phase can still be large, allowing for potential merging into 2MB
  mappings once the TD is running.

- TD runtime
  This series allows a private fault's max_level to be 2MB after TD is
  running. KVM core MMU will map/unmap 2MB mappings in the mirror page
  table according to a fault's goal_level as what're done for normal VMs.
  Changes in the mirror page table are then propagated to the S-EPT.

  For transitions from non-present to huge leaf in the mirror page table,
  hook set_external_spte is invoked, leading to the execution of
  tdh_mem_page_aug() to install a huge leaf in the S-EPT.

  Conversely, during transitions from a huge leaf to non-present, the
  remove_external_spte hook is invoked to execute SEAMCALLs that remove the
  huge leaf from the S-EPT.

  (For transitions from huge leaf to non-leaf, or from non-leaf to huge
   leaf, SPTE splitting/merging will be triggered. More details are in
   later sections.)

- Specify fault max_level
  In the TDP MMU, a fault's max_level is initially set to the 1GB level for
  x86. KVM then updates the fault's max_level by determining the lowest
  order among fault->max_level, the order of the allocated private page,
  and the TDX-specified max_level from hook private_max_mapping_level.
  For TDX, a private fault's req_level, and goal_level finally equal to the
  fault's max_level as TDX platforms do not have the flaw for NX huge page.

  So, if TDX has specific requirements to influence a fault's goal_level
  for private memory (e.g., if it knows an EPT violation is caused by a
  TD's ACCEPT operation, mapping at the ACCEPT's level is preferred), this
  can be achieved either by affecting the initial value of fault->max_level
  or through the private_max_mapping_level hook.

  The former approach requires more changes in the KVM core (e.g., by using
  some bits in the error_code passed to kvm_mmu_page_fault() and having
  KVM check for them). This RFC opts for the latter, simpler method, using
  the private_max_mapping_level hook.

Page splitting (page demotion)
------------------------------
Page splitting occurs in two paths:
(a) with exclusive kvm->mmu_lock, triggered by zapping operations,

    For normal VMs, if zapping a narrow region that would need to split a
    huge page, KVM can simply zap the surrounding GFNs rather than
    splitting a huge page. The pages can then be faulted back in, where KVM
    can handle mapping them at a 4KB level.

    The reason why TDX can't use the normal VM solution is that zapping
    private memory that is accepted cannot easily be re-faulted, since it
    can only be re-faulted as unaccepted. So KVM will have to sometimes do
    the page splitting as part of the zapping operations.

    These zapping operations can occur for few reasons:
    1. VM teardown.
    2. Memslot removal.
    3. Conversion of private pages to shared.
    4. Userspace does a hole punch to guest_memfd for some reason.

    For case 1 and 2, splitting before zapping is unnecessary because
    either the entire range will be zapped or huge pages do not span
    memslots.

    Case 3 or case 4 requires splitting, which is also followed by a
    backend page splitting in guest_memfd.

(b) with shared kvm->mmu_lock, triggered by fault.

    Splitting in this path is not accompanied by a backend page splitting
    (since backend page splitting necessitates a splitting and zapping
     operation in the former path).  It is triggered when KVM finds that a
    non-leaf entry is replacing a huge entry in the fault path, which is
    usually caused by vCPUs' concurrent ACCEPT operations at different
    levels.

    This series simply ignores the splitting request in the fault path to
    avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
    at a lower level would finally figures out the page has been accepted
    at a higher level by another vCPU.

    A rare case that could lead to splitting in the fault path is when a TD
    is configured to receive #VE and accesses memory before the ACCEPT
    operation. By the time a vCPU accesses a private GFN, due to the lack
    of any guest preferred level, KVM could create a mapping at 2MB level.
    If the TD then only performs the ACCEPT operation at 4KB level,
    splitting in the fault path will be triggered. However, this is not
    regarded as a typical use case, as usually TD always accepts pages in
    the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
    splitting request is an endless EPT violation. This would not happen
    for a Linux guest, which does not expect any #VE.

- Splitting for private-to-shared conversion or punch hole
  Splitting of a huge mapping requires the allocation of page table page
  and the corresponding shadow structures. This memory allocation can fail.
  So, while the zapping operations in the two scenarios don't have an
  understanding of failure, the overall operations do. Therefore, the RFC
  introduces a separate step kvm_split_boundary_leafs() to split huge
  mappings ahead of the zapping operation.

  Patches 16-17 implement this change. As noted in the patch log, the
  downside of the current approach is that although
  kvm_split_boundary_leafs() is invoked before kvm_unmap_gfn_range() for
  each GFN range, the entire zapping range may consist of several GFN
  ranges. If an out-of-memory error occurs during the splitting of a GFN
  range, some previous GFN ranges may have been successfully split and
  zapped, even though their page attributes remain unchanged due to the
  splitting failure. This may not be a significant issue, as the user can
  retry the ioctl to split and zap the full range. However, if it becomes
  problematic, further modifications to invoke kvm_unmap_gfn_range() after
  executing kvm_mmu_invalidate_range_add() and kvm_split_boundary_leafs()
  for all GFN ranges could address the problem.

  Alternatively, a possible solution could be pre-allocating sufficiently
  large splitting caches at the start of the private-to-shared conversion
  or hole punch process. The downside is that this may allocate more memory
  than necessary and require more code changes.

- The full call stack for huge page splitting

  With exclusive kvm->mmu_lock,
  kvm_vm_set_mem_attributes/kvm_gmem_punch_hole
     |kvm_split_boundary_leafs
     |   |kvm_tdp_mmu_gfn_range_split_boundary
     |       |tdp_mmu_split_boundary_leafs
     |           |tdp_mmu_alloc_sp_for_split
     |           |tdp_mmu_split_huge_page
     |               |tdp_mmu_link_sp
     |                   |tdp_mmu_iter_set_spte
     |                       |tdp_mmu_set_spte
     |                           |split_external_spt
     |                               |kvm_x86_split_external_spt
     |                                   | BLOCK, TRACK, DEMOTION
     |kvm_mmu_unmap_gfn_range

  With shared kvm->mmu_lock,
  kvm_tdp_mmu_map
     |tdp_mmu_alloc_sp
     |kvm_mmu_alloc_external_spt
     |tdp_mmu_split_huge_page
         |tdp_mmu_link_sp
             |tdp_mmu_set_spte_atomic
                 |__tdp_mmu_set_spte_atomic
		    |set_external_spte_present
		        |split_external_spt
			    |kvm_x86_split_external_spt

- Handle busy & errors

  Splitting huge mappings in S-EPT requires to execute
  tdh_mem_range_block(), tdh_mem_track(), kicking off vCPUs,
  tdh_mem_page_demote() in sequence.

  Possible errors during the process include TDX_OPERAND_BUSY or
  TDX_INTERRUPTED_RESTARTABLE.

  With exclusive kvm->mmu_lock, TDX_OPERAND_BUSY can be handled similarly
  to removing a private page, i.e., by kicking off all vCPUs and retrying,
  which should succeed on the second attempt.

  TDX_INTERRUPTED_RESTARTABLE occurs when there is a pending interrupt on
  the host side during SEAMCALL tdh_mem_page_demote(). The approach is to
  retry indefinitely in KVM for TDX_INTERRUPTED_RESTARTABLE, because the
  interrupts are for host only in current exclusive kvm->mmu_lock path.

Page merging (page promotion)
-----------------------------
  The RFC disallows the page merging on the mirror page table.

  Unlike normal VMs, private memory in TDX requires the guest's ACCEPT
  operation. Therefore, transitioning from a non-leaf entry to a huge leaf
  entry in the S-EPT requires the non-leaf entry to be initially populated
  with small child entries, all in PENDING or ACCEPTED status.
  Subsequently, the merged huge leaf can be set to either PENDING or
  ACCEPTED status.

  Therefore, counter-intuitively, converting a partial range (e.g., one
  4KB) of a 2MB range from private to shared and then converting back to
  private does not result in a successful page promotion in the S-EPT.
  After converting a shared 4KB page back to private:
  a) Linux Guest: Accepts the 4K page prior to accessing memory, prompting
     KVM to map it at the 4KB level, which prevents further EPT violations
     and avoids triggering page promotion.
  b) Non-Linux Guest: May access the page before executing the ACCEPT
     operation. KVM identifies the physical page is 2MB contiguous and maps
     it at 2MB, causing a non-leaf to leaf transition in the mirror page
     table. However, after the preparation step, only 511 child entries in
     the S-EPT are in ACCEPTED status, with 1 newly mapped entry in PENDING
     status. The promotion request to the S-EPT fails due to this mixed
     status. If KVM re-enters the guest and triggers #VE for the guest to
     accept the page, the guest must accept the page at the 4KB level, as
     no 2MB mapping is available. After the ACCEPT operation, no further
     EPT violations occur to trigger page promotion.

  So, also to avoid the comprehensive BUSY handling and rolling back code
  due to shared kvm->mmu_lock, the RFC disallows the page merging on the
  mirror page table. This should have minimal performance impact in
  practice, as up to now no page merging is observed for a real guest,
  except for the selftests.

Patches layout
==============
Patch 01: Glue code to [2].
          It allows kvm_gmem_populate() and kvm_gmem_get_pfn() to get a
          2MB private huge page from guest_memfd whenever GFN/index
          alignment, remaining size, and page attribute layout.
          Though this patch may not be needed after guest_memfd supporting
          in-place conversion in future, the guest_memfd needs to ensure
          something similar.
Patches 02-03: SEAMCALL changes under x86/virt.
Patches 04-09: Basic private huge page mapping/unmapping.
           04: For build time, no huge pages, forced to 4KB.
        05-07: Enhancements of tdx_clear_page(),tdx_reclaim_page,
               tdx_wbinvd_page() to handle huge pages.
           08: inc/dec folio ref count for huge pages.
               The increasing of private folio ref count should be dropped
               after guest_memfd supporting in-place conversion. TDX will
               then only acquire private folio ref count upon errors during
               the page removing/reclaiming stage.
           09: Turn on mapping/unmapping of huge pages for TD runtime.
Patch 10: Disallow page merging in the mirror page table.
Patches 11-12: Allow guest's ACCEPT level to determine page mapping size. 
Patches 13-19: Basic page splitting support (with exclusive kvm->mmu_lock)
           13: Enhance tdp_mmu_alloc_sp_split() for external page table
           14: Add code to propagate splitting request to external page
               table in tdp_mmu_set_spte(), which updates SPTE under
               exclusive kvm->mmu_lock.
           15: TDX's counter part to patch 14. Implementation of hook
               split_external_spt.
        16-19: Split private huge pages for private-to-shared conversion
               and punch hole.
Patches 20-21: Ignore page splitting request with shared kvm->mmu_lock

Main changes to [0]:
===================
- Disallow huge mappings in TD build time.
- Use hook private_max_mapping_level to convey TDX's mapping level info
  instead of having KVM MMU core to check certain bits in error_code to
  determine a fault's max_level.
- Move tdh_mem_range_block() for page splitting to TDX's implementation of
  hook split_external_spt.
- Do page splitting before tdp_mmu_zap_leafs(). So instead of BUG_ON() the
  tdp_mmu_zap_leafs(), out-of-memory failure for splitting can fail the
  ioctl KVM_SET_MEMORY_ATTRIBUTES or punch hole.
- Restrict page splitting to be under exclusive kvm->mmu_lock and ignore
  the page splitting under shared kvm->mmu_lock.
- Drop page merging support.

Testing
-------
The series is based on kvm/next.

This patchset is also available at: [3]
It is able to launch TDs with page demotion working correctly. Though it's
still unable to trigger page promotion with a linux guest yet, the part of
page promotion code is tested working with a selftest.

It's able to check huge mapping count in KVM at runtime at
/sys/kernel/debug/kvm/pages_2m.
(Though this node includes huge mapping count for both shared and private
memory, currently there're not many shared huge pages. In future,
guest_memfd in-place conversion requires all shared pages to be 4KB. So
there's no need to expand this interface).

[0] https://lore.kernel.org/all/cover.1708933624.git.isaku.yamahata@xxxxxxxxx
[1] https://lore.kernel.org/lkml/cover.1726009989.git.ackerleytng@xxxxxxxxxx
[2] https://lore.kernel.org/all/20241212063635.712877-1-michael.roth@xxxxxxx
[3] https://github.com/intel/tdx/tree/huge_page_kvm_next_2025_04_23

Edgecombe, Rick P (1):
  KVM: x86/mmu: Disallow page merging (huge page adjustment) for mirror
    root

Isaku Yamahata (1):
  KVM: x86/tdp_mmu: Alloc external_spt page for mirror page table
    splitting

Xiaoyao Li (5):
  x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_page_demote()
  KVM: TDX: Enhance tdx_clear_page() to support huge pages
  KVM: TDX: Assert the reclaimed pages were mapped as expected
  KVM: TDX: Add a helper for WBINVD on huge pages with TD's keyID
  KVM: TDX: Support huge page splitting with exclusive kvm->mmu_lock

Yan Zhao (14):
  KVM: gmem: Allocate 2M huge page from guest_memfd backend
  x86/virt/tdx: Enhance tdh_mem_page_aug() to support huge pages
  KVM: TDX: Enforce 4KB mapping level during TD build Time
  KVM: TDX: Increase/decrease folio ref for huge pages
  KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE
  KVM: x86: Add "vcpu" "gfn" parameters to x86 hook
    private_max_mapping_level
  KVM: TDX: Determine max mapping level according to vCPU's ACCEPT level
  KVM: x86/tdp_mmu: Invoke split_external_spt hook with exclusive
    mmu_lock
  KVM: x86/mmu: Introduce kvm_split_boundary_leafs() to split boundary
    leafs
  KVM: Change the return type of gfn_handler_t() from bool to int
  KVM: x86: Split huge boundary leafs before private to shared
    conversion
  KVM: gmem: Split huge boundary leafs for punch hole of private memory
  KVM: x86: Force a prefetch fault's max mapping level to 4KB for TDX
  KVM: x86: Ignore splitting huge pages in fault path for TDX

 arch/arm64/kvm/mmu.c               |   4 +-
 arch/loongarch/kvm/mmu.c           |   4 +-
 arch/mips/kvm/mmu.c                |   4 +-
 arch/powerpc/kvm/book3s.c          |   4 +-
 arch/powerpc/kvm/e500_mmu_host.c   |   4 +-
 arch/riscv/kvm/mmu.c               |   4 +-
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   7 +-
 arch/x86/include/asm/tdx.h         |   2 +
 arch/x86/kvm/mmu/mmu.c             |  67 +++++---
 arch/x86/kvm/mmu/mmu_internal.h    |   2 +-
 arch/x86/kvm/mmu/paging_tmpl.h     |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c         | 200 +++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h         |   1 +
 arch/x86/kvm/svm/sev.c             |   5 +-
 arch/x86/kvm/svm/svm.h             |   5 +-
 arch/x86/kvm/vmx/main.c            |   8 +-
 arch/x86/kvm/vmx/tdx.c             | 244 +++++++++++++++++++++++------
 arch/x86/kvm/vmx/tdx.h             |   4 +
 arch/x86/kvm/vmx/tdx_arch.h        |   3 +
 arch/x86/kvm/vmx/tdx_errno.h       |   1 +
 arch/x86/kvm/vmx/x86_ops.h         |  14 +-
 arch/x86/virt/vmx/tdx/tdx.c        |  31 +++-
 arch/x86/virt/vmx/tdx/tdx.h        |   1 +
 include/linux/kvm_host.h           |  13 +-
 virt/kvm/guest_memfd.c             | 183 ++++++++++------------
 virt/kvm/kvm_main.c                |  38 +++--
 27 files changed, 612 insertions(+), 244 deletions(-)

-- 
2.43.2