On Tue, May 20, 2025 at 7:34 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote: > > Hi Vishal, > > On Tue, 20 May 2025 at 15:11, Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote: > > > > On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote: > > > > > > Hi Vishal, > > > > > > On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@xxxxxxxxxx> wrote: > > > > > > > > On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote: > > > > > > > > > > Hi Ackerley, > > > > > > > > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote: > > > > > > > > > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and > > > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared > > > > > > and private respectively. > > > > > > > > > > I have a high level question about this particular patch and this > > > > > approach for conversion: why do we need IOCTLs to manage conversion > > > > > between private and shared? > > > > > > > > > > In the presentations I gave at LPC [1, 2], and in my latest patch > > > > > series that performs in-place conversion [3] and the associated (by > > > > > now outdated) state diagram [4], I didn't see the need to have a > > > > > userspace-facing interface to manage that. KVM has all the information > > > > > it needs to handle conversions, which are triggered by the guest. To > > > > > me this seems like it adds additional complexity, as well as a user > > > > > facing interface that we would need to maintain. > > > > > > > > > > There are various ways we could handle conversion without explicit > > > > > interference from userspace. What I had in mind is the following (as > > > > > an example, details can vary according to VM type). I will use use the > > > > > case of conversion from shared to private because that is the more > > > > > complicated (interesting) case: > > > > > > > > > > - Guest issues a hypercall to request that a shared folio become private. > > > > > > > > > > - The hypervisor receives the call, and passes it to KVM. > > > > > > > > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86 > > > > > parlance), and unmaps it from the host. The host however, could still > > > > > have references (e.g., GUP). > > > > > > > > > > - KVM exits to the host (hypervisor call exit), with the information > > > > > that the folio has been unshared from it. > > > > > > > > > > - A well behaving host would now get rid of all of its references > > > > > (e.g., release GUPs), perform a VCPU run, and the guest continues > > > > > running as normal. I expect this to be the common case. > > > > > > > > > > But to handle the more interesting situation, let's say that the host > > > > > doesn't do it immediately, and for some reason it holds on to some > > > > > references to that folio. > > > > > > > > > > - Even if that's the case, the guest can still run *. If the guest > > > > > tries to access the folio, KVM detects that access when it tries to > > > > > fault it into the guest, sees that the host still has references to > > > > > that folio, and exits back to the host with a memory fault exit. At > > > > > this point, the VCPU that has tried to fault in that particular folio > > > > > cannot continue running as long as it cannot fault in that folio. > > > > > > > > Are you talking about the following scheme? > > > > 1) guest_memfd checks shareability on each get pfn and if there is a > > > > mismatch exit to the host. > > > > > > I think we are not really on the same page here (no pun intended :) ). > > > I'll try to answer your questions anyway... > > > > > > Which get_pfn? Are you referring to get_pfn when faulting the page > > > into the guest or into the host? > > > > I am referring to guest fault handling in KVM. > > > > > > > > > 2) host user space has to guess whether it's a pending refcount or > > > > whether it's an actual mismatch. > > > > > > No need to guess. VCPU run will let it know exactly why it's exiting. > > > > > > > 3) guest_memfd will maintain a third state > > > > "pending_private_conversion" or equivalent which will transition to > > > > private upon the last refcount drop of each page. > > > > > > > > If conversion is triggered by userspace (in case of pKVM, it will be > > > > triggered from within the KVM (?)): > > > > > > Why would conversion be triggered by userspace? As far as I know, it's > > > the guest that triggers the conversion. > > > > > > > * Conversion will just fail if there are extra refcounts and userspace > > > > can try to get rid of extra refcounts on the range while it has enough > > > > context without hitting any ambiguity with memory fault exit. > > > > * guest_memfd will not have to deal with this extra state from 3 above > > > > and overall guest_memfd conversion handling becomes relatively > > > > simpler. > > > > > > That's not really related. The extra state isn't necessary any more > > > once we agreed in the previous discussion that we will retry instead. > > > > Who is *we* here? Which entity will retry conversion? > > Userspace will re-attempt the VCPU run. Then KVM will have to keep track of the ranges that need conversion across exits. I think it's cleaner to let userspace make the decision and invoke conversion without carrying additional state in KVM about guest request. > > > > > > > > Note that for x86 CoCo cases, memory conversion is already triggered > > > > by userspace using KVM ioctl, this series is proposing to use > > > > guest_memfd ioctl to do the same. > > > > > > The reason why for x86 CoCo cases conversion is already triggered by > > > userspace using KVM ioctl is that it has to, since shared memory and > > > private memory are two separate pages, and userspace needs to manage > > > that. Sharing memory in place removes the need for that. > > > > Userspace still needs to clean up memory usage before conversion is > > successful. e.g. remove IOMMU mappings for shared to private > > conversion. I would think that memory conversion should not succeed > > before all existing users let go of the guest_memfd pages for the > > range being converted. > > Yes. Userspace will know that it needs to do that on the VCPU exit, > which informs it of the guest's hypervisor request to unshare (convert > from shared to private) the page. > > > In x86 CoCo usecases, userspace can also decide to not allow > > conversion for scenarios where ranges are still under active use by > > the host and guest is erroneously trying to take away memory. Both > > SNP/TDX spec allow failure of conversion due to in use memory. > > How can the guest erroneously try to take away memory? If the guest > sends a hypervisor request asking for a conversion of memory that > doesn't belong to it, then I would expect the hypervisor to prevent > that. Making a range as private is effectively disallowing host from accessing those ranges -> so taking away memory. > > I don't see how having an IOCTL to trigger the conversion is needed to > allow conversion failure. How is that different from userspace > ignoring or delaying releasing all references it has for the > conversion request? > > > > > > > This series isn't using the same ioctl, it's introducing new ones to > > > perform a task that as far as I can tell so far, KVM can handle by > > > itself. > > > > I would like to understand this better. How will KVM handle the > > conversion process for guest_memfd pages? Can you help walk an example > > sequence for shared to private conversion specifically around > > guest_memfd offset states? > > To make sure that we are discussing the same scenario: can you do the > same as well please --- walk me through an example sequence for shared > to private conversion specifically around guest_memfd offset states > With the IOCTLs involved? > > Here is an example that I have implemented and tested with pKVM. Note > that there are alternatives, the flow below is architecture or even > vm-type dependent. None of this code is code KVM code and the > behaviour could vary. > > > Assuming the folio is shared with the host: > > Guest sends unshare hypercall to the hypervisor > Hypervisor forwards request to KVM (gmem) (having done due diligence) > KVM (gmem) performs an unmap_folio(), exits to userspace with For x86 CoCo VM usecases I was talking about, userspace would like to avoid unmap_mapping_range() on the range before it's safe to unshare the range. > KVM_EXIT_UNSHARE and all the information about the folio being > unshared > > Case 1: > Userspace removes any remaining references (GUPs, IOMMU Mappings etc...) > Userspace calls vcpu_run(): KVM (gmem) sees that there aren't any > references, sets state to PRIVATE > > Case 2 (alternative 1): > Userspace doesn't release its references > Userspace calls vcpu_run(): KVM (gmem) sees that there are still > references, exits back to userspace with KVM_EXIT_UNSHARE > > Case 2 (alternative 2): > Userspace doesn't release its references > Userspace calls vcpu_run(): KVM (gmem) sees that there are still > references, unmaps folio from guest, but allows it to run (until it > tries to fault in the folio) > Guest tries to fault in folio that still has reference, KVM does not > allow that (it sees that the folio is shared, and it doesn't fault in > shared folios to confidential guests) > KVM exits back to userspace with KVM_EXIT_UNSHARE > > As I mentioned, the alternatives above are _not_ set in core KVM code. > They can vary by architecture of VM type, depending on the policy, > support, etc.. > > Now for your example please on how this would work with IOCTLs :) > > Thanks, > /fuad > > > > > > > > - Allows not having to keep track of separate shared/private range > > > > information in KVM. > > > > > > This patch series is already tracking shared/private range information in KVM. > > > > > > > - Simpler handling of the conversion process done per guest_memfd > > > > rather than for full range. > > > > - Userspace can handle the rollback as needed, simplifying error > > > > handling in guest_memfd. > > > > - guest_memfd is single source of truth and notifies the users of > > > > shareability change. > > > > - e.g. IOMMU, userspace, KVM MMU all can be registered for > > > > getting notifications from guest_memfd directly and will get notified > > > > for invalidation upon shareability attribute updates. > > > > > > All of these can still be done without introducing a new ioctl. > > > > > > Cheers, > > > /fuad