On Fri, Jul 11, 2025 at 5:11 PM Michael Roth <michael.roth@xxxxxxx> wrote: > > > > Wishful thinking on my part: It would be great to figure out a way to > > promote these pagetable entries without relying on the guest, if > > possible with ABI updates, as I think the host should have some > > control over EPT/NPT granularities even for Confidential VMs. Along > > I'm not sure how much it would buy us. For example, for a 2MB hugetlb > SNP guest boot with 16GB of memory I see 622 2MB hugepages getting > split, but only about 30 or so of those get merged back to 2MB folios > during guest run-time. These are presumably the set of 2MB regions we > could promote back up, but it's not much given that we wouldn't expect > that value to grow proportionally for larger guests: it's really > separate things like the number of vCPUs (for shared GHCB pages), number > of virtio buffers, etc. that end up determining the upper bound on how > many pages might get split due to 4K private->shared conversion, and > these would vary all that much from get to get outside maybe vCPU > count. > > For 1GB hugetlb I see about 6 1GB pages get split, and only 2 get merged > during run-time and would be candidates for promotion. > Thanks for the great analysis here. I think we will need to repeat such analysis for other scenarios such as usage with accelerators. > This could be greatly improved from the guest side by using > higher-order allocations to create pools of shared memory that could > then be used to reduce the number of splits caused by doing > private->shared conversions on random ranges of malloc'd memory, > and this could be done even without special promotion support on the > host for pretty much the entirety of guest memory. The idea there would > be to just making optimized guests avoid the splits completely, rather > than relying on the limited subset that hardware can optimize without > guest cooperation. Yes, it would be great to improve the situation from the guest side, e.g. I tried with a rough draft [1], the conclusion there was that we need to set aside "enough" guest memory as CMA to cause all the DMA go through 2M aligned buffers. It's hard to figure out how much is "enough", but we could start somewhere. That being said, the host still has to manage memory this way by splitting/merging at runtime because I don't think it's possible to enforce all conversions to happen at 2M (or any at 1G) granularity. So it's also very likely that even if guests do significant chunk of conversions at hugepage granularity, host still needs to split pages all the way to 4K for all shared regions unless we can bake another restriction in the conversion ABI that guests can only convert the same ranges to private as were converted before to shared. [1] https://lore.kernel.org/lkml/20240112055251.36101-1-vannapurve@xxxxxxxxxx/ > > > the similar lines, it would be great to have "page struct"-less memory > > working for Confidential VMs, which should greatly reduce the toil > > with merge/split operations and will render the conversions mostly to > > be pagetable manipulations. > > FWIW, I did some profiling of split/merge vs. overall conversion time > (by that I mean all cycles spent within kvm_gmem_convert_execute_work()), > and while split/merge does take quite a few more cycles than your > average conversion operation (~100x more), the total cycles spent > splitting/merging ended up being about 7% of the total cycles spent > handling conversions (1043938460 cycles in this case). > > For 1GB, a split/merge take >1000x more than a normal conversion > operation (46475980 cycles vs 320 in this sample), but it's probably > still not too bad vs the overall conversion path, and as mentioned above > it only happens about 6x for 16GB SNP guest so I don't think split/merge > overhead is a huge deal for current guests, especially if we work toward > optimizing guest-side usage of shared memory in the future. (There is > potential for this to crater performance for a very poorly-optimized > guest however but I think the guest should bear some burden for that > sort of thing: e.g. flipping the same page back-and-forth between > shared/private vs. caching it for continued usage as shared page in the > guest driver path isn't something we should put too much effort into > optimizing.) > As per discussions in the past, guest_memfd private pages are simply only managed by guest_memfd. We don't need and effectively don't want the kernel to manage guest private memory. So effectively we can get rid of page structs in theory just for private pages as well and allocate page structs only for shared memory on conversion and deallocate on conversion back to private. And when we have base core-mm allocators that hand out raw pfns to start with, we don't even need shared memory ranges to be backed by page structs. Few hurdles we need to cross: 1) Invent a new filemap equivalent that maps guest_memfd offsets to pfns 2) Modify TDX EPT management to work with pfns and not page structs 3) Modify generic KVM NPT/EPT management logic to work with pfns and not rely on page structs 4) Modify memory error/hwpoison handling to route all memory errors on such pfns to guest_memfd. I believe there are obvious benefits (reduced complexity, reduced memory footprint etc) if we go this route and we are very likely to go this route for future usecases even if we decide to live with conversion costs today.