Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

Adrian Hunter <adrian.hunter@xxxxxxxxx> · Thu, 27 Mar 2025 12:10:05 +0200

On 27/03/25 10:14, Vishal Annapurve wrote:
> On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>> ...
>> == Problem ==
>>
>> Currently, Dynamic Page Removal is being used when the TD is being
>> shutdown for the sake of having simpler initial code.
>>
>> This happens when guest_memfds are closed, refer kvm_gmem_release().
>> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
>> happen until after they are released, refer kvm_gmem_release().
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time.  For example:
>>
>>         VCPUs   Size (GB)       Before (secs)   After (secs)
>>          4       18              72              24
>>         32      107             517             134
> 
> If the time for reclaim grows linearly with memory size, then this is
> a significantly high value for TD cleanup (~21 minutes for a 1TB VM).
> 
>>
>> Note, the V19 patch set:
>>
>>         https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@xxxxxxxxx/
>>
>> did not have this issue because the HKID was released early, something that
>> Sean effectively NAK'ed:
>>
>>         "No, the right answer is to not release the HKID until the VM is
>>         destroyed."
>>
>>         https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@xxxxxxxxxx/
> 
> IIUC, Sean is suggesting to treat S-EPT page removal and page reclaim
> separately. Through his proposal:

Thanks for looking at this!

It seems I am using the term "reclaim" wrongly.  Sorry!

I am talking about taking private memory away from the guest,
not what happens to it subsequently.  When the TDX VM is in "Runnable"
state, taking private memory away is slow (slow S-EPT removal).
When the TDX VM is in "Teardown" state, taking private memory away
is faster (a TDX SEAMCALL named TDH.PHYMEM.PAGE.RECLAIM which is where
I picked up the term "reclaim")

Once guest memory is removed from S-EPT, further action is not
needed to reclaim it.  It belongs to KVM at that point.

guest_memfd memory can be added directly to S-EPT.  No intermediate
state or step is used.  Any guest_memfd memory not given to the
MMU (S-EPT), can be freed directly if userspace/KVM wants to.
Again there is no intermediate state or (reclaim) step.

> 1) If userspace drops last reference on gmem inode before/after
> dropping the VM reference
>     -> slow S-EPT removal and slow page reclaim

Currently slow S-EPT removal happens when the file is released.

> 2) If memslots are removed before closing the gmem and dropping the VM reference
>     -> slow S-EPT page removal and no page reclaim until the gmem is around.
> 
> Reclaim should ideally happen when the host wants to use that memory
> i.e. for following scenarios:
> 1) Truncation of private guest_memfd ranges
> 2) Conversion of private guest_memfd ranges to shared when supporting
> in-place conversion (Could be deferred to the faulting in as shared as
> well).
> 
> Would it be possible for you to provide the split of the time spent in
> slow S-EPT page removal vs page reclaim?

Based on what I wrote above, all the time is spent removing pages
from S-EPT.  Greater that 99% of shutdown time is kvm_gmem_release().

> 
> It might be worth exploring the possibility of parallelizing or giving
> userspace the flexibility to parallelize both these operations to
> bring the cleanup time down (to be comparable with non-confidential VM
> cleanup time for example).