Re: [PATCH v2 00/13] KVM: Introduce KVM Userfault

James Houghton <jthoughton@xxxxxxxxxx> · Thu, 29 May 2025 12:17:07 -0400

On Thu, May 29, 2025 at 11:28 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, May 28, 2025, James Houghton wrote:
> > The only thing that I want to call out again is that this UAPI works
> > great for when we are going from userfault --> !userfault. That is, it
> > works well for postcopy (both for guest_memfd and for standard
> > memslots where userfaultfd scalability is a concern).
> >
> > But there is another use case worth bringing up: unmapping pages that
> > the VMM is emulating as poisoned.
> >
> > Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for
> > 4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains
> > mapped in userspace), KVM Userfault is the only option (if we don't
> > want to punch holes in memslots). This leaves us with three problems:
> >
> > 1. If using KVM Userfault to emulate poison, we are stuck with small
> > pages in stage 2 for the entire memslot.
> > 2. We must unmap everything when toggling on KVM Userfault just to
> > unmap a single page.
> > 3. If KVM Userfault is already enabled, we have no choice but to
> > toggle KVM Userfault off and on again to unmap the newly poisoned
> > pages (i.e., there is no ioctl to scan the bitmap and unmap
> > newly-userfault pages).
> >
> > All of these are non-issues if we emulate poison by removing memslots,
> > and I think that's possible. But if that proves too slow, we'd need to
> > be a little bit more clever with hugepage recovery and with unmapping
> > newly-userfault pages, both of which I think can be solved by adding
> > some kind of bitmap re-scan ioctl. We can do that later if the need
> > arises.
>
> Hmm.
>
> On the one hand, punching a hole in a memslot is generally gross, e.g. requires
> deleting the entire memslot and thus unmapping large swaths of guest memory (or
> all of guest memory for most x86 VMs).
>
> On the other hand, unless userspace sets KVM_MEM_USERFAULT from time zero, KVM
> will need to unmap guest memory (or demote the mapping size a la eager page
> splitting?) when KVM_MEM_USERFAULT is toggled from 0=>1.
>
> One thought would be to change the behavior of KVM's processing of the userfault
> bitmap, such that KVM doesn't infer *anything* about the mapping sizes, and instead
> give userspace more explicit control over the mapping size.  However, on non-x86
> architectures, implementing such a control would require a non-trivial amount of
> code and complexity, and would incur overhead that doesn't exist today (i.e. we'd
> need to implement equivalent infrastructure to x86's disallow_lpage tracking).
>
> And IIUC, another problem with KVM Userfault is that it wouldn't Just Work for
> KVM accesses to guest memory.  E.g. if the HugeTLB page is still mapped into
> userspace, then depending on the flow that gets hit, I'm pretty sure that emulating
> an access to the poisoned memory would result in KVM_EXIT_INTERNAL_ERROR, whereas
> punching a hole in a memslot would result in a much more friendly KVM_EXIT_MMIO.

Oh, yes, of course. KVM Userfault is not enough for memory poison
emulation for non-guest-memfd memslots. Like how for these memslots we
need userfaultfd to do post-copy properly, for memory poison, we still
need userfaultfd (so 4K emulated poison within a HugeTLB memslot is
not possible).

So yeah in this case (4K poison in a still-mapped HugeTLB page), we
would need to punch a hole and get KVM_EXIT_MMIO. SGTM.

For guest_memfd memslots, we can handle uaccess to emulated poison
like tmpfs: with UFFDIO_POISON (Nikita has already started on
UFFDIO_CONTINUE support[1]). We *could* make the gmem page fault
handler (what Fuad is implementing) respect KVM Userfault, but that
isn't necessary (and would look quite like a reimplementation of
userfaultfd).

[1]: https://lore.kernel.org/kvm/20250404154352.23078-1-kalyazin@xxxxxxxxxx/

> All in all, given that KVM needs to correctly handle hugepage vs. memslot
> alignment/size issues no matter what, and that KVM has well-established behavior
> for handling no-memslot accesses, I'm leaning towards saying userspace should
> punch a hole in the memslot in order to emulate a poisoned page.  The only reason
> I can think of for preferring a different approach is if userspace can't provide
> the desired latency/performance characteristics when punching a hole in a memslot.
> Hopefully reacting to a poisoned page is a fairly slow path?

In general, yes it is. Memory poison is rare.

For non-HugeTLB (tmpfs or guest_memfd), I don't think we need to punch
a hole, so that's good. For HugeTLB, there are two circumstances that
are perhaps concerning:

1. Learning about poison during post-copy? This should be vanishingly
rare, as most poison is discovered in the first pre-copy pass. If we
didn't do *any* pre-copy passes, then it could be a concern.
2. Learning about poison during pre-copy after shattering? If doing
lazy page splitting with incremental dirty log clearing, this isn't a
*huge* problem, otherwise it could be.

I think userspace has two ways out: (1) don't make super large
memslots, or (2) don't use HugeTLB.

Just to be clear, this isn't really an issue with KVM Userfault -- in
its current form (not preventing KVM's uaccess), it cannot help here.