On Thu, Jul 3, 2025 at 10:38 PM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: > > On 03/07/2025 20:06, Vishal Annapurve wrote: > > On Thu, Jul 3, 2025 at 8:37 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: > >> > >> Avoid clearing reclaimed TDX private pages unless the platform is affected > >> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown > >> time on unaffected systems. > >> > >> Background > >> > >> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which: > >> > >> - Clears the TD Owner bit (which identifies TDX private memory) and > >> integrity metadata without triggering integrity violations. > >> - Clears poison from cache lines without consuming it, avoiding MCEs on > >> access (refer TDX Module Base spec. 16.5. Handling Machine Check > >> Events during Guest TD Operation). > >> > >> The TDX module also uses MOVDIR64B to initialize private pages before use. > >> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC. > >> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b > >> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages") > >> > >> In contrast, when private pages are reclaimed, the TDX Module handles > >> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL. > >> > >> Problem > >> > >> Clearing all private pages during VM shutdown is costly. For guests > >> with a large amount of memory it can take minutes. > >> > >> Solution > >> > >> TDX Module Base Architecture spec. documents that private pages reclaimed > >> from a TD should be initialized using MOVDIR64B, in order to avoid > >> integrity violation or TD bit mismatch detection when later being read > >> using a shared HKID, refer April 2025 spec. "Page Initialization" in > >> section "8.6.2. Platforms not Using ACT: Required Cache Flush and > >> Initialization by the Host VMM" > >> > >> That is an overstatement and will be clarified in coming versions of the > >> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on > >> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li > >> Mode" in the same spec, there is no issue accessing such reclaimed pages > >> using a shared key that does not have integrity enabled. Linux always uses > >> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID > >> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit > >> description in "Intel Architecture Memory Encryption Technologies" spec > >> version 1.6 April 2025. So there is no need to clear pages to avoid > >> integrity violations. > >> > >> There remains a risk of poison consumption. However, in the context of > >> TDX, it is expected that there would be a machine check associated with the > >> original poisoning. On some platforms that results in a panic. However > >> platforms may support "SEAM_NR" Machine Check capability, in which case > >> Linux machine check handler marks the page as poisoned, which prevents it > >> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce: > >> Implement recovery for errors in TDX/SEAM non-root mode") > >> > >> Improvement > >> > >> By skipping the clearing step on unaffected platforms, shutdown time > >> can improve by up to 40%. > > > > This patch looks good to me. > > > > I would like to raise a related topic, is there any requirement for > > zeroing pages on conversion from private to shared before > > userspace/guest faults in the gpa ranges as shared? > > For TDX, clearing must still be done for platforms with the > partial-write errata (SPR and EMR). > So I take it that vmm/guest_memfd can safely assume no responsibility of clearing contents on conversion outside of the X86_BUG_TDX_PW_MCE scenario, given that the spec doesn't dictate initial contents of converted memory and no guest/host software should depend on the initial values after conversion. > > > > If the answer is no for all CoCo architectures then guest_memfd can > > simply just zero pages on allocation for all it's users and not worry > > about zeroing later. > > In fact TDX does not need private pages to be zeroed on allocation > because the TDX Module always does that. > guest_memfd allocated pages may get faulted in as shared first. To keep things simple, guest_memfd can start with the "just zero on allocation" policy which works for all current/future CoCo/non-CoCo users of guest_memfd and we can later iterate with any arch-specific optimizations as needed.