On Thu, Aug 28, 2025, Rick P Edgecombe wrote: > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote: > > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other > > operations is not a concern. > > Just was my recollection of the discussion. I found it: > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@xxxxxxxxxx/ Ugh, another case where an honest question gets interpreted as "do it this way". :-( > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current > > code is also broken in the sense that there are no cond_resched() calls. The > > vast majority of TDX hosts will be using non-preemptible kernels, so without an > > explicit cond_resched(), there's no practical difference between extending the > > measurement under mmu_lock versus outside of mmu_lock. > > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should > > still do tdh_mem_page_add() under mmu_lock. > > I just did a quick test and we should be on the order of <1 ms per page for the > full loop. I can try to get some more formal test data if it matters. But that > doesn't sound too horrible? 1ms is totally reasonable. I wouldn't bother with any more testing. > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be > inside it. Agreed, and it would eliminate the need for a "flags" argument. But keeping it in the mmu_lock critical section means KVM can WARN on failures. If it's moved out, then zapping S-EPT entries could induce failure, and I don't think it's worth going through the effort to ensure it's impossible to trigger S-EPT removal. Note, temoving S-EPT entries during initialization of the image isn't something I want to official support, rather it's an endless stream of whack-a-mole due to obsurce edge cases Hmm, actually, maybe I take that back. slots_lock prevents memslot updates, filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events shouldn't ever hit S-EPT. I was worried about kvm_zap_gfn_range(), but the call from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while I'm 99% certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), the APIC page has its own non-guest_memfd memslot and so can't be used for the initial image, which means that too is mutually exclusive. So yeah, let's give it a shot. Worst case scenario we're wrong and TDH_MR_EXTEND errors can be triggered by userspace. > But maybe a better reason is that we could better handle errors > outside the fault. (i.e. no 5 line comment about why not to return an error in > tdx_mem_page_add() due to code in another file). > > I wonder if Yan can give an analysis of any zapping races if we do that. As above, I think we're good?