On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Jul 08, 2025, Rick P Edgecombe wrote: > > On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote: > > > > Right, I read that. I still don't see why pKVM needs to do normal > > > > private/shared > > > > conversion for data provisioning. Vs a dedicated operation/flag to make it a > > > > special case. > > > > > > It's dictated by pKVM usecases, memory contents need to be preserved > > > for every conversion not just for initial payload population. > > > > We are weighing pros/cons between: > > - Unifying this uABI across all gmemfd VM types > > - Userspace for one VM type passing a flag for it's special non-shared use case > > > > I don't see how passing a flag or not is dictated by pKVM use case. > > Yep. Baking the behavior of a single usecase into the kernel's ABI is rarely a > good idea. Just because pKVM's current usecases always wants contents to be > preserved doesn't mean that pKVM will never change. > > As a general rule, KVM should push policy to userspace whenever possible. > > > P.S. This doesn't really impact TDX I think. Except that TDX development needs > > to work in the code without bumping anything. So just wishing to work in code > > with less conditionals. > > > > > > > > > > > > > I'm trying to suggest there could be a benefit to making all gmem VM types > > > > behave the same. If conversions are always content preserving for pKVM, why > > > > can't userspace always use the operation that says preserve content? Vs > > > > changing the behavior of the common operations? > > > > > > I don't see a benefit of userspace passing a flag that's kind of > > > default for the VM type (assuming pKVM will use a special VM type). > > > > The benefit is that we don't need to have special VM default behavior for > > gmemfd. Think about if some day (very hypothetical and made up) we want to add a > > mode for TDX that adds new private data to a running guest (with special accept > > on the guest side or something). Then we might want to add a flag to override > > the default destructive behavior. Then maybe pKVM wants to add a "don't > > preserve" operation and it adds a second flag to not destroy. Now gmemfd has > > lots of VM specific flags. The point of this example is to show how unified uABI > > can he helpful. > > Yep again. Pivoting on the VM type would be completely inflexible. If pKVM gains > a usecase that wants to zero memory on conversions, we're hosed. If SNP or TDX > gains the ability to preserve data on conversions, we're hosed. > > The VM type may restrict what is possible, but (a) that should be abstracted, > e.g. by defining the allowed flags during guest_memfd creation, and (b) the > capabilities of the guest_memfd instance need to be communicated to userspace. Ok, I concur with this: It's beneficial to keep a unified ABI that allows guest_memfd to make runtime decisions without relying on VM type as far as possible. Few points that seem important here: 1) Userspace can and should be able to only dictate if memory contents need to be preserved on shared to private conversion. -> For SNP/TDX VMs: * Only usecase for preserving contents is initial memory population, which can be achieved by: - Userspace converting the ranges to shared, populating the contents, converting them back to private and then calling SNP/TDX specific existing ABI functions. * For runtime conversions, guest_memfd can't ensure memory contents are preserved during shared to private conversions as the architectures don't support that behavior. * So IMO, this "preserve" flag doesn't make sense for SNP/TDX VMs, even if we add this flag, today guest_memfd should effectively mark this unsupported based on the backing architecture support. 2) For pKVM, if userspace wants to specify a "preserve" flag then this flag can be allowed based on the known capabilities of the backing architecture. So this topic is still orthogonal to "zeroing on private to shared conversion". > > > > Common operations in guest_memfd will need to either check for the > > > userspace passed flag or the VM type, so no major change in > > > guest_memfd implementation for either mechanism. > > > > While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd > > fd tied to a VM? > > Yes. > > > I think there is interest in de-coupling it? > > No? Even if we get to a point where multiple distinct VMs can bind to a single > guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole > owner of the memory. AFAICT, fully decoupling guest_memfd from a VM would add > non-trivial complexity for zero practical benefit. > > > Is the VM type sticky? > > > > It seems the more they are separate, the better it will be to not have VM-aware > > behavior living in gmem. > > Ya. A guest_memfd instance may have capabilities/features that are restricted > and/or defined based on the properties of the owning VM, but we should do our > best to make guest_memfd itself blissly unaware of the VM type.