On Thu, May 29, 2025 at 11:28 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Wed, May 28, 2025, James Houghton wrote: > > The only thing that I want to call out again is that this UAPI works > > great for when we are going from userfault --> !userfault. That is, it > > works well for postcopy (both for guest_memfd and for standard > > memslots where userfaultfd scalability is a concern). > > > > But there is another use case worth bringing up: unmapping pages that > > the VMM is emulating as poisoned. > > > > Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for > > 4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains > > mapped in userspace), KVM Userfault is the only option (if we don't > > want to punch holes in memslots). This leaves us with three problems: > > > > 1. If using KVM Userfault to emulate poison, we are stuck with small > > pages in stage 2 for the entire memslot. > > 2. We must unmap everything when toggling on KVM Userfault just to > > unmap a single page. > > 3. If KVM Userfault is already enabled, we have no choice but to > > toggle KVM Userfault off and on again to unmap the newly poisoned > > pages (i.e., there is no ioctl to scan the bitmap and unmap > > newly-userfault pages). > > > > All of these are non-issues if we emulate poison by removing memslots, > > and I think that's possible. But if that proves too slow, we'd need to > > be a little bit more clever with hugepage recovery and with unmapping > > newly-userfault pages, both of which I think can be solved by adding > > some kind of bitmap re-scan ioctl. We can do that later if the need > > arises. > > Hmm. > > On the one hand, punching a hole in a memslot is generally gross, e.g. requires > deleting the entire memslot and thus unmapping large swaths of guest memory (or > all of guest memory for most x86 VMs). > > On the other hand, unless userspace sets KVM_MEM_USERFAULT from time zero, KVM > will need to unmap guest memory (or demote the mapping size a la eager page > splitting?) when KVM_MEM_USERFAULT is toggled from 0=>1. > > One thought would be to change the behavior of KVM's processing of the userfault > bitmap, such that KVM doesn't infer *anything* about the mapping sizes, and instead > give userspace more explicit control over the mapping size. However, on non-x86 > architectures, implementing such a control would require a non-trivial amount of > code and complexity, and would incur overhead that doesn't exist today (i.e. we'd > need to implement equivalent infrastructure to x86's disallow_lpage tracking). > > And IIUC, another problem with KVM Userfault is that it wouldn't Just Work for > KVM accesses to guest memory. E.g. if the HugeTLB page is still mapped into > userspace, then depending on the flow that gets hit, I'm pretty sure that emulating > an access to the poisoned memory would result in KVM_EXIT_INTERNAL_ERROR, whereas > punching a hole in a memslot would result in a much more friendly KVM_EXIT_MMIO. Oh, yes, of course. KVM Userfault is not enough for memory poison emulation for non-guest-memfd memslots. Like how for these memslots we need userfaultfd to do post-copy properly, for memory poison, we still need userfaultfd (so 4K emulated poison within a HugeTLB memslot is not possible). So yeah in this case (4K poison in a still-mapped HugeTLB page), we would need to punch a hole and get KVM_EXIT_MMIO. SGTM. For guest_memfd memslots, we can handle uaccess to emulated poison like tmpfs: with UFFDIO_POISON (Nikita has already started on UFFDIO_CONTINUE support[1]). We *could* make the gmem page fault handler (what Fuad is implementing) respect KVM Userfault, but that isn't necessary (and would look quite like a reimplementation of userfaultfd). [1]: https://lore.kernel.org/kvm/20250404154352.23078-1-kalyazin@xxxxxxxxxx/ > All in all, given that KVM needs to correctly handle hugepage vs. memslot > alignment/size issues no matter what, and that KVM has well-established behavior > for handling no-memslot accesses, I'm leaning towards saying userspace should > punch a hole in the memslot in order to emulate a poisoned page. The only reason > I can think of for preferring a different approach is if userspace can't provide > the desired latency/performance characteristics when punching a hole in a memslot. > Hopefully reacting to a poisoned page is a fairly slow path? In general, yes it is. Memory poison is rare. For non-HugeTLB (tmpfs or guest_memfd), I don't think we need to punch a hole, so that's good. For HugeTLB, there are two circumstances that are perhaps concerning: 1. Learning about poison during post-copy? This should be vanishingly rare, as most poison is discovered in the first pre-copy pass. If we didn't do *any* pre-copy passes, then it could be a concern. 2. Learning about poison during pre-copy after shattering? If doing lazy page splitting with incremental dirty log clearing, this isn't a *huge* problem, otherwise it could be. I think userspace has two ways out: (1) don't make super large memslots, or (2) don't use HugeTLB. Just to be clear, this isn't really an issue with KVM Userfault -- in its current form (not preventing KVM's uaccess), it cannot help here.