On Thu, Jul 3, 2025 at 1:41 PM Michael Roth <michael.roth@xxxxxxx> wrote: > > > > > > > > > > > > Because shared pages are split once any memory is allocated, having a > > > > > > way to INIT_PRIVATE could avoid the split and then merge on > > > > > > conversion. I feel that is enough value to have this config flag, what > > > > > > do you think? > > > > > > > > > > > > I guess we could also have userspace be careful not to do any allocation > > > > > > before converting. > > > > > > (Re-visiting this with the assumption that we *don't* intend to use mmap() to > > > populate memory (in which case you can pretty much ignore my previous > > > response)) > > > > I am assuming in-place conversion with huge page backing for the > > discussion below. > > > > Looks like there are three scenarios/usecases we are discussing here: > > 1) Pre-allocating guest_memfd file offsets > > - Userspace can use fallocate to do this for hugepages by keeping > > the file ranges marked private. > > 2) Prefaulting guest EPT/NPT entries > > 3) Populating initial guest payload into guest_memfd memory > > - Userspace can mark certain ranges as shared, populate the > > contents and convert the ranges back to private. So mmap will come in > > handy here. > > > > > > > > I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP, > > > userspace already defaults to marking everything private pretty close to > > > guest_memfd creation time, so the potential for allocations to occur > > > in-between seems small, but worth confirming. > > > > Ok, I am not much worried about whether the INIT_PRIVATE flag gets > > supported or not, but more about the default setting that different > > CVMs start with. To me, it looks like all CVMs should start as > > everything private by default and if there is a way to bake that > > configuration during guest_memfd creation time that would be good to > > have instead of doing "create and convert" operations and there is a > > fairly low cost to support this flag. > > > > > > > > But I know in the past there was a desire to ensure TDX/SNP could > > > support pre-allocating guest_memfd memory (and even pre-faulting via > > > KVM_PRE_FAULT_MEMORY), but I think that could still work right? The > > > fallocate() handling could still avoid the split if the whole hugepage > > > is private, though there is a bit more potential for that fallocate() > > > to happen before userspace does the "manually" shared->private > > > conversion. I'll double-check on that aspect, but otherwise, is there > > > still any other need for it? > > > > This usecase of being able to preallocate should still work with > > in-place conversion assuming all ranges are private before > > pre-population. > > Ok, I think I was missing that the merge logic here will then restore it > to 1GB before the guest starts, so the folio isn't permanently split if > we do the mmap() and that gives us more flexibility on how we can use > it. > > I was thinking we needed to avoid the split from the start by avoiding > paths like mmap() which might trigger the split. I was trying to avoid > any merge->unsplit logic in the THP case (or unsplit in general), in > which case we'd get permanent splits via the mmap() approach, but for > 2MB that's probably not a big deal. After initial payload population, during its runtime guest can cause different hugepages to get split which can remain split even after guest converts them back to private. For THP there may not be much benefit of merging those pages together specially if NPT/EPT entries can't be promoted back to hugepage mapping and there is no memory penalty as THP doesn't use HVO. Wishful thinking on my part: It would be great to figure out a way to promote these pagetable entries without relying on the guest, if possible with ABI updates, as I think the host should have some control over EPT/NPT granularities even for Confidential VMs. Along the similar lines, it would be great to have "page struct"-less memory working for Confidential VMs, which should greatly reduce the toil with merge/split operations and will render the conversions mostly to be pagetable manipulations. That being said, memory split and merge seem to be relatively lightweight for THP (with no memory allocation/freeing) and reusing the memory files after reboot of the guest VM will require pages to be merged to start with a clean slate. One possible option is to always merge as early as possible, second option is to invent a new UAPI to do it on demand. For 1G pages, even if we go with 1G -> 2M -> 4K split stages, page splits result in higher memory usage with HVO around and it becomes useful to merge them back as early as possible as guest proceeds to convert subranges of different hugepages over its lifetime. Merging pages as early as possible also allows reusing of memory files during the next reboot without having to invent a new UAPI. Caveats with "merge as early as possible": - Shared to private conversions will be slower for hugetlb pages. * Counter argument: These conversions are already slow as we need safe refcounts to reach on the ranges getting converted. - If guests convert a particular range often then extra merge/split operations will result in overhead. * Counter argument: Since conversions are anyways slow, it's beneficial for guests to avoid such a scenario and keep back and forth conversions as less frequent as possible.