On Tue, Jul 29, 2025, Ira Weiny wrote: > Yan Zhao wrote: > > On Mon, Jul 28, 2025 at 05:45:35PM -0700, Vishal Annapurve wrote: > > > On Mon, Jul 28, 2025 at 2:49 AM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > > > > > On Fri, Jul 18, 2025 at 08:57:10AM -0700, Vishal Annapurve wrote: > > > > > On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > > > > > > > > > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote: > > > > > > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote: > > > > > > > > > > folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order); > > > > > > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for > > > > > > > > > GFN+1 will return is_prepared == true. > > > > > > > > > > > > > > > > I don't see any reason to try and make the current code truly work with hugepages. > > > > > > > > Unless I've misundertood where we stand, the correctness of hugepage support is > > > > > > > Hmm. I thought your stand was to address the AB-BA lock issue which will be > > > > > > > introduced by huge pages, so you moved the get_user_pages() from vendor code to > > > > > > > the common code in guest_memfd :) > > > > > > > > > > > > > > > going to depend heavily on the implementation for preparedness. I.e. trying to > > > > > > > > make this all work with per-folio granulartiy just isn't possible, no? > > > > > > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn() > > > > > > > should return is_prepared at 4KB granularity rather than per-folio granularity. > > > > > > > > > > > > > > So, huge pages still has dependency on the implementation for preparedness. > > > > > > Looks with [3], is_prepared will not be checked in kvm_gmem_populate(). > > > > > > > > > > > > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use > > > > > > > as prerequisites for TDX huge page v2? > > > > > > So, maybe I can use [1][2][3] as the base. > > > > > > > > > > > > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@xxxxxxxxxx/ > > > > > > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@xxxxxxxxxx/ > > > > > > > > From the PUCK, looks Sean said he'll post [1][2] for 6.18 and Michael will post > > > > [3] soon. > > > > > > > > hi, Sean, is this understanding correct? > > > > > > > > > IMO, unless there is any objection to [1], it's un-necessary to > > > > > maintain kvm_gmem_populate for any arch (even for SNP). All the > > > > > initial memory population logic needs is the stable pfn for a given > > > > > gfn, which ideally should be available using the standard mechanisms > > > > > such as EPT/NPT page table walk within a read KVM mmu lock (This patch > > > > > already demonstrates it to be working). > > > > > > > > > > It will be hard to clean-up this logic once we have all the > > > > > architectures using this path. > > > > > > > > > > [1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@xxxxxxxxxxxxxx/ > > > > IIUC, the suggestion in the link is to abandon kvm_gmem_populate(). > > > > For TDX, it means adopting the approach in this RFC patch, right? > > > Yes, IMO this RFC is following the right approach as posted. I don't think we want to abandon kvm_gmem_populate(). Unless I'm missing something, SNP has the same AB-BA problem as TDX. The copy_from_user() on @src can trigger a page fault, and resolving the page fault may require taking mm->mmap_lock. Fundamentally, TDX and SNP are doing the same thing: copying from source to guest memory. The only differences are in the mechanics of the copy+encrypt, everything else is the same. I.e. I don't expect that we'll find a magic solution that works well for one and not the other. I also don't want to end up with wildly different ABI for SNP vs. everything else. E.g. cond_resched() needs to be called if the to-be-initialzied range is large, which means dropping mmu_lock between pages, whereas kvm_gmem_populate() can yield without dropping invalidate_lock, which means that the behavior of populating guest_memfd memory will be quite different with respect to guest_memfd operations. Pulling in the RFC text: : I think the only different scenario is SNP, where the host must write : initial contents to guest memory. : : Will this work for all cases CCA/SNP/TDX during initial memory : population from within KVM: : 1) Simulate stage2 fault : 2) Take a KVM mmu read lock Doing all of this under mmu_lock is pretty much a non-starter. : 3) Check that the needed gpa is mapped in EPT/NPT entries No, KVM's page tables are not the source of truth. S-EPT is a special snowflake, and I'd like to avoid foisting the same requirements on NPT. : 4) For SNP, if src != null, make the target pfn to be shared, copy : contents and then make the target pfn back to private. Copying from userspace under spinlock (rwlock) is illegal, as accessing userspace memory might_fault() and thus might_sleep(). : 5) For TDX, if src != null, pass the same address for source and : target (likely this works for CCA too) : 6) Invoke appropriate memory encryption operations : 7) measure contents : 8) release the KVM mmu read lock : : If this scheme works, ideally we should also not call RMP table : population logic from guest_memfd, but from KVM NPT fault handling : logic directly (a bit of cosmetic change). LOL, that's not a cosmetic change. It would be a non-trivial ABI change as KVM's ABI (ignoring S-EPT) is that userspace can delete and recreate memslots at will. : Ideally any outgoing interaction from guest_memfd to KVM should be only via invalidation notifiers. Why? It's all KVM code. I don't see how this is any different than e.g. common code, arch code, and vendor code all calling into one another. Artificially limiting guest_memfd to a super generic interface pretty much defeats the whole purpose of having KVM provide a backing store. > > Ira has been investigating this for a while, see if he has any comment. > > So far I have not seen any reason to keep kvm_gmem_populate() either. > > Sean, did yall post the patch you suggested here and I missed it? No, I have a partially baked patch, but I've been trying to finish up v5 of the mediated PMU series and haven't had time to focus on this. Hopefully I'll post a compile-tested patch later this week.