Re: [PATCH V2 2/2] x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 3, 2025 at 10:38 PM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>
> On 03/07/2025 20:06, Vishal Annapurve wrote:
> > On Thu, Jul 3, 2025 at 8:37 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
> >>
> >> Avoid clearing reclaimed TDX private pages unless the platform is affected
> >> by the X86_BUG_TDX_PW_MCE erratum. This significantly reduces VM shutdown
> >> time on unaffected systems.
> >>
> >> Background
> >>
> >> KVM currently clears reclaimed TDX private pages using MOVDIR64B, which:
> >>
> >>    - Clears the TD Owner bit (which identifies TDX private memory) and
> >>      integrity metadata without triggering integrity violations.
> >>    - Clears poison from cache lines without consuming it, avoiding MCEs on
> >>      access (refer TDX Module Base spec. 16.5. Handling Machine Check
> >>      Events during Guest TD Operation).
> >>
> >> The TDX module also uses MOVDIR64B to initialize private pages before use.
> >> If cache flushing is needed, it sets TDX_FEATURES.CLFLUSH_BEFORE_ALLOC.
> >> However, KVM currently flushes unconditionally, refer commit 94c477a751c7b
> >> ("x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages")
> >>
> >> In contrast, when private pages are reclaimed, the TDX Module handles
> >> flushing via the TDH.PHYMEM.CACHE.WB SEAMCALL.
> >>
> >> Problem
> >>
> >> Clearing all private pages during VM shutdown is costly. For guests
> >> with a large amount of memory it can take minutes.
> >>
> >> Solution
> >>
> >> TDX Module Base Architecture spec. documents that private pages reclaimed
> >> from a TD should be initialized using MOVDIR64B, in order to avoid
> >> integrity violation or TD bit mismatch detection when later being read
> >> using a shared HKID, refer April 2025 spec. "Page Initialization" in
> >> section "8.6.2. Platforms not Using ACT: Required Cache Flush and
> >> Initialization by the Host VMM"
> >>
> >> That is an overstatement and will be clarified in coming versions of the
> >> spec. In fact, as outlined in "Table 16.2: Non-ACT Platforms Checks on
> >> Memory" and "Table 16.3: Non-ACT Platforms Checks on Memory Reads in Li
> >> Mode" in the same spec, there is no issue accessing such reclaimed pages
> >> using a shared key that does not have integrity enabled. Linux always uses
> >> KeyID 0 which never has integrity enabled. KeyID 0 is also the TME KeyID
> >> which disallows integrity, refer "TME Policy/Encryption Algorithm" bit
> >> description in "Intel Architecture Memory Encryption Technologies" spec
> >> version 1.6 April 2025. So there is no need to clear pages to avoid
> >> integrity violations.
> >>
> >> There remains a risk of poison consumption. However, in the context of
> >> TDX, it is expected that there would be a machine check associated with the
> >> original poisoning. On some platforms that results in a panic. However
> >> platforms may support "SEAM_NR" Machine Check capability, in which case
> >> Linux machine check handler marks the page as poisoned, which prevents it
> >> from being allocated anymore, refer commit 7911f145de5fe ("x86/mce:
> >> Implement recovery for errors in TDX/SEAM non-root mode")
> >>
> >> Improvement
> >>
> >> By skipping the clearing step on unaffected platforms, shutdown time
> >> can improve by up to 40%.
> >
> > This patch looks good to me.
> >
> > I would like to raise a related topic, is there any requirement for
> > zeroing pages on conversion from private to shared before
> > userspace/guest faults in the gpa ranges as shared?
>
> For TDX, clearing must still be done for platforms with the
> partial-write errata (SPR and EMR).
>

So I take it that vmm/guest_memfd can safely assume no responsibility
of clearing contents on conversion outside of the X86_BUG_TDX_PW_MCE
scenario, given that the spec doesn't dictate initial contents of
converted memory and no guest/host software should depend on the
initial values after conversion.

> >
> > If the answer is no for all CoCo architectures then guest_memfd can
> > simply just zero pages on allocation for all it's users and not worry
> > about zeroing later.
>
> In fact TDX does not need private pages to be zeroed on allocation
> because the TDX Module always does that.
>

guest_memfd allocated pages may get faulted in as shared first. To
keep things simple, guest_memfd can start with the "just zero on
allocation" policy which works for all current/future CoCo/non-CoCo
users of guest_memfd and we can later iterate with any arch-specific
optimizations as needed.





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux