On 03/07/2025 20:06, Vishal Annapurve wrote: > On Thu, Jul 3, 2025 at 8:37 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: >> >> Avoid clearing reclaimed TDX private pages unless the platform is affected >> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown >> time on unaffected systems. >> >> Background >> >> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which: >> >> - Clears the TD Owner bit (which identifies TDX private memory) and >> integrity metadata without triggering integrity violations. >> - Clears poison from cache lines without consuming it, avoiding MCEs on >> access (refer TDX Module Base spec. 16.5. Handling Machine Check >> Events during Guest TD Operation). >> >> The TDX module also uses MOVDIR64B to initialize private pages before use. >> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC. >> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b >> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages") >> >> In contrast, when private pages are reclaimed, the TDX Module handles >> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL. >> >> Problem >> >> Clearing all private pages during VM shutdown is costly. For guests >> with a large amount of memory it can take minutes. >> >> Solution >> >> TDX Module Base Architecture spec. documents that private pages reclaimed >> from a TD should be initialized using MOVDIR64B, in order to avoid >> integrity violation or TD bit mismatch detection when later being read >> using a shared HKID, refer April 2025 spec. "Page Initialization" in >> section "8.6.2. Platforms not Using ACT: Required Cache Flush and >> Initialization by the Host VMM" >> >> That is an overstatement and will be clarified in coming versions of the >> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on >> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li >> Mode" in the same spec, there is no issue accessing such reclaimed pages >> using a shared key that does not have integrity enabled. Linux always uses >> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID >> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit >> description in "Intel Architecture Memory Encryption Technologies" spec >> version 1.6 April 2025. So there is no need to clear pages to avoid >> integrity violations. >> >> There remains a risk of poison consumption. However, in the context of >> TDX, it is expected that there would be a machine check associated with the >> original poisoning. On some platforms that results in a panic. However >> platforms may support "SEAM_NR" Machine Check capability, in which case >> Linux machine check handler marks the page as poisoned, which prevents it >> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce: >> Implement recovery for errors in TDX/SEAM non-root mode") >> >> Improvement >> >> By skipping the clearing step on unaffected platforms, shutdown time >> can improve by up to 40%. > > This patch looks good to me. > > I would like to raise a related topic, is there any requirement for > zeroing pages on conversion from private to shared before > userspace/guest faults in the gpa ranges as shared? For TDX, clearing must still be done for platforms with the partial-write errata (SPR and EMR). > > If the answer is no for all CoCo architectures then guest_memfd can > simply just zero pages on allocation for all it's users and not worry > about zeroing later. In fact TDX does not need private pages to be zeroed on allocation because the TDX Module always does that.