On Wed, Jun 18, 2025 at 12:41 PM Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > On Wed, Jun 18, 2025 at 04:24:13AM +0000, James Houghton wrote: > > KVM Userfault consists of a bitmap in userspace that describes which > > pages the user wants exits on (when KVM_MEM_USERFAULT is enabled). To > > get those exits, the memslot where KVM_MEM_USERFAULT is being enabled > > must drop (at least) all of the translations that the bitmap says should > > generate faults. Today, simply drop all translations for the memslot. Do > > so with a new arch interface, kvm_arch_userfault_enabled(), which can be > > specialized in the future by any architecture for which optimizations > > make sense. > > > > Make some changes to kvm_set_memory_region() to support setting > > KVM_MEM_USERFAULT on KVM_MEM_GUEST_MEMFD memslots, including relaxing > > the retrictions on guest_memfd memslots from only deletion to no moving. > > > > Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx> > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > --- > > > +#ifdef CONFIG_KVM_GENERIC_PAGE_FAULT > > +bool kvm_do_userfault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > The polarity of the return here feels weird. If we want a value of 0 to > indicate success then int is a better return type. The way it's written now feels fine to me. I'm happy to change it to an int (where we return -EFAULT instead of 'true' and 0 instead of 'false'). > > +{ > > + struct kvm_memory_slot *slot = fault->slot; > > + unsigned long __user *user_chunk; > > + unsigned long chunk; > > + gfn_t offset; > > + > > + if (!kvm_is_userfault_memslot(slot)) > > + return false; > > + > > + offset = fault->gfn - slot->base_gfn; > > + user_chunk = slot->userfault_bitmap + (offset / BITS_PER_LONG); > > + > > + if (__get_user(chunk, user_chunk)) > > + return true; > > + > > I see that the documentation suggests userspace perform a store-release > to update the bitmap. That's the right idea but we need a load-acquire > on the consumer side for that to do something meaningful. Indeed, the below test_bit() should be test_bit_acquire(), thank you! (N.B. I don't think the current code could result in an observable bug, given that the later write of the PTE has a control dependency here. But it is certainly written incorrectly.) > > + if (!test_bit(offset % BITS_PER_LONG, &chunk)) > > + return false; > > + > > + kvm_prepare_memory_fault_exit(vcpu, fault); > > + vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_USERFAULT; > > + return true; > > +} > > +#endif > > + > > int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > struct kvm_enable_cap *cap) > > { > > -- > > 2.50.0.rc2.692.g299adb8693-goog > > > > Thanks, > Oliver