On 7/21/25 08:08, Tom Lendacky wrote: > On 7/17/25 16:46, Kai Huang wrote: >> This series is the latest attempt to support kexec on TDX host following >> Dave's suggestion to use a percpu boolean to control WBINVD during >> kexec. >> >> Hi Boris/Tom, >> >> As requested, I added the first patch to cleanup the last two 'unsigned >> int' parameters of the relocate_kernel() into one 'unsigned int' and pass >> flags instead. The patch 2 (patch 1 in v3) also gets updated based on >> that. Would you help to review? Thanks. >> >> I tested that both normal kexec and preserve_context kexec works (using >> the tools/testing/selftests/kexec/test_kexec_jump.sh). But I don't have >> SME capable machine to test. >> >> Hi Tom, I added your Reviewed-by and Tested-by in the patch 2 anyway >> since I believe the change is trivial and straightforward). But due to >> the cleanup patch, I appreciate if you can help to test the first two >> patches again. Thanks a lot! > > Everything is working, Thanks! See my comments in patch #1. I didn't test with context preservation, so that bit was never set. If it was, I think things would have failed. Thanks, Tom > > Tom > >> >> v3 -> v4: >> - Rebase to latest tip/master. >> - Add a cleanup patch to consolidate relocate_kernel()'s last two >> function parameters -- Boris. >> - Address comments received -- please see individual patches. >> - Collect tags (Tom, Rick, binbin). >> >> v3: https://lore.kernel.org/kvm/cover.1750934177.git.kai.huang@xxxxxxxxx/ >> >> v2 -> v3 (all trivial changes): >> >> - Rebase on latest tip/master >> - change to use __always_inline for do_seamcall() in patch 2 >> - Update patch 2 (changelog and code comment) to remove the sentence >> which says "not all SEAMCALLs generate dirty cachelines of TDX >> private memory but just treat all of them do." -- Dave. >> - Add Farrah's Tested-by for all TDX patches. >> >> The v2 had one informal RFC patch appended to show "some optimization" >> which can move WBINVD from the kexec phase to an early stage in KVM. >> Paolo commented and Acked that patch (thanks!), so this v3 made that >> patch as a formal one (patch 6). But technically it is not absolutely >> needed in this series but can be done in the future. >> >> More history info can be found in v2: >> >> https://lore.kernel.org/lkml/cover.1746874095.git.kai.huang@xxxxxxxxx/ >> >> === More information === >> >> TDX private memory is memory that is encrypted with private Host Key IDs >> (HKID). If the kernel has ever enabled TDX, part of system memory >> remains TDX private memory when kexec happens. E.g., the PAMT (Physical >> Address Metadata Table) pages used by the TDX module to track each TDX >> memory page's state are never freed once the TDX module is initialized. >> TDX guests also have guest private memory and secure-EPT pages. >> >> After kexec, the new kernel will have no knowledge of which memory page >> was used as TDX private page and can use all memory as regular memory. >> >> 1) Cache flush >> >> Per TDX 1.5 base spec "8.6.1.Platforms not Using ACT: Required Cache >> Flush and Initialization by the Host VMM", to support kexec for TDX, the >> kernel needs to flush cache to make sure there's no dirty cachelines of >> TDX private memory left over to the new kernel (when the TDX module >> reports TDX_FEATURES.CLFLUSH_BEFORE_ALLOC as 1 in the global metadata for >> the platform). The kernel also needs to make sure there's no more TDX >> activity (no SEAMCALL) after cache flush so that no new dirty cachelines >> of TDX private memory are generated. >> >> SME has similar requirement. SME kexec support uses WBINVD to do the >> cache flush. WBINVD is able to flush cachelines associated with any >> HKID. Reuse the WBINVD introduced by SME to flush cache for TDX. >> >> Currently the kernel explicitly checks whether the hardware supports SME >> and only does WBINVD if true. Instead of adding yet another TDX >> specific check, this series uses a percpu boolean to indicate whether >> WBINVD is needed on that CPU during kexec. >> >> 2) Reset TDX private memory using MOVDIR64B >> >> The TDX spec (the aforementioned section) also suggests the kernel >> *should* use MOVDIR64B to clear TDX private page before the kernel >> reuses it as regular one. >> >> However, in reality the situation can be more flexible. Per TDX 1.5 >> base spec ("Table 16.2: Non-ACT Platforms Checks on Memory Reads in Ci >> Mode" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li >> Mode"), the read/write to TDX private memory using shared KeyID without >> integrity check enabled will not poison the memory and cause machine >> check. >> >> Note on the platforms with ACT (Access Control Table), there's no >> integrity check involved thus no machine check is possible to happen due >> to memory read/write using different KeyIDs. >> >> KeyID 0 (TME key) doesn't support integrity check. This series chooses >> to NOT reset TDX private memory but leave TDX private memory as-is to the >> new kernel. As mentioned above, in practice it is safe to do so. >> >> 3) One limitation >> >> If the kernel has ever enabled TDX, after kexec the new kernel won't be >> able to use TDX anymore. This is because when the new kernel tries to >> initialize TDX module it will fail on the first SEAMCALL due to the >> module has already been initialized by the old kernel. >> >> More (non-trivial) work will be needed for the new kernel to use TDX, >> e.g., one solution is to just reload the TDX module from the location >> where BIOS loads the TDX module (/boot/efi/EFI/TDX/). This series >> doesn't cover this, but leave this as future work. >> >> 4) Kdump support >> >> This series also enables kdump with TDX, but no special handling is >> needed for crash kexec (except turning on the Kconfig option): >> >> - kdump kernel uses reserved memory from the old kernel as system ram, >> and the old kernel will never use the reserved memory as TDX memory. >> - /proc/vmcore contains TDX private memory pages. It's meaningless to >> read them, but it doesn't do any harm either. >> >> 5) TDX "partial write machine check" erratum >> >> On the platform with TDX erratum, a partial write (a write transaction >> of less than a cacheline lands at memory controller) to TDX private >> memory poisons that memory, and a subsequent read triggers machine >> check. On those platforms, the kernel needs to reset TDX private memory >> before jumping to the new kernel otherwise the new kernel may see >> unexpected machine check. >> >> The kernel currently doesn't track which page is TDX private memory. >> It's not trivial to reset TDX private memory. For simplicity, this >> series simply disables kexec/kdump for such platforms. This can be >> enhanced in the future. >> >> >> >> Kai Huang (7): >> x86/kexec: Consolidate relocate_kernel() function parameters >> x86/sme: Use percpu boolean to control WBINVD during kexec >> x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL >> x86/kexec: Disable kexec/kdump on platforms with TDX partial write >> erratum >> x86/virt/tdx: Remove the !KEXEC_CORE dependency >> x86/virt/tdx: Update the kexec section in the TDX documentation >> KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs >> >> Documentation/arch/x86/tdx.rst | 14 ++++----- >> arch/x86/Kconfig | 1 - >> arch/x86/include/asm/kexec.h | 12 ++++++-- >> arch/x86/include/asm/processor.h | 2 ++ >> arch/x86/include/asm/tdx.h | 31 +++++++++++++++++++- >> arch/x86/kernel/cpu/amd.c | 17 +++++++++++ >> arch/x86/kernel/machine_kexec_64.c | 43 ++++++++++++++++++++++------ >> arch/x86/kernel/process.c | 24 +++++++--------- >> arch/x86/kernel/relocate_kernel_64.S | 30 +++++++++++-------- >> arch/x86/kvm/vmx/tdx.c | 12 ++++++++ >> arch/x86/virt/vmx/tdx/tdx.c | 16 +++++++++-- >> 11 files changed, 155 insertions(+), 47 deletions(-) >> >> >> base-commit: e180b3a224cb519388c2f61ca7bc1eaf94cec1fb