Re: [PATCH v10 13/16] KVM: arm64: Handle guest_memfd-backed guest page faults

Fuad Tabba <tabba@xxxxxxxxxx> · Wed, 4 Jun 2025 14:30:18 +0100

Hi David

On Wed, 4 Jun 2025 at 14:17, David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 27.05.25 20:02, Fuad Tabba wrote:
> > Add arm64 support for handling guest page faults on guest_memfd backed
> > memslots. Until guest_memfd supports huge pages, the fault granule is
> > restricted to PAGE_SIZE.
> >
> > Signed-off-by: Fuad Tabba <tabba@xxxxxxxxxx>
> >
> > ---
> >
> > Note: This patch introduces a new function, gmem_abort() rather than
> > previous attempts at trying to expand user_mem_abort(). This is because
> > there are many differences in how faults are handled when backed by
> > guest_memfd vs regular memslots with anonymous memory, e.g., lack of
> > VMA, and for now, lack of huge page support for guest_memfd. The
> > function user_mem_abort() is already big and unwieldly, adding more
> > complexity to it made things more difficult to understand.
> >
> > Once larger page size support is added to guest_memfd, we could factor
> > out the common code between these two functions.
> >
> > ---
> >   arch/arm64/kvm/mmu.c | 89 +++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 87 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 9865ada04a81..896c56683d88 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1466,6 +1466,87 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> >       return vma->vm_flags & VM_MTE_ALLOWED;
> >   }
> >
> > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > +                       struct kvm_memory_slot *memslot, bool is_perm)
>
> TBH, I have no idea why the existing function is called "_abort". I am
> sure there is a good reason :)
>

The reason is ARM. They're called "memory aborts", see D8.15 Memory
aborts in the ARM ARM:

https://developer.arm.com/documentation/ddi0487/latest/

Warning: PDF is 100mb+ with almost 15k pages :)

> > +{
> > +     enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
> > +     enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > +     bool logging, write_fault, exec_fault, writable;
> > +     struct kvm_pgtable *pgt;
> > +     struct page *page;
> > +     struct kvm *kvm;
> > +     void *memcache;
> > +     kvm_pfn_t pfn;
> > +     gfn_t gfn;
> > +     int ret;
> > +
> > +     if (!is_perm) {
> > +             int min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu);
> > +
> > +             if (!is_protected_kvm_enabled()) {
> > +                     memcache = &vcpu->arch.mmu_page_cache;
> > +                     ret = kvm_mmu_topup_memory_cache(memcache, min_pages);
> > +             } else {
> > +                     memcache = &vcpu->arch.pkvm_memcache;
> > +                     ret = topup_hyp_memcache(memcache, min_pages);
> > +             }
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +
> > +     kvm = vcpu->kvm;
> > +     gfn = fault_ipa >> PAGE_SHIFT;
>
> These two can be initialized directly above.
>

I was trying to go with reverse christmas tree order of declarations,
but I'll do that.

> > +
> > +     logging = memslot_is_logging(memslot);
> > +     write_fault = kvm_is_write_fault(vcpu);
> > +     exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>  > +    VM_BUG_ON(write_fault && exec_fault);
>
> No VM_BUG_ON please.
>
> VM_WARN_ON_ONCE() maybe. Or just handle it along the "Unexpected L2 read
> permission error" below cleanly.

I'm following the same pattern as the existing user_mem_abort(), but
I'll change it.

> > +
> > +     if (is_perm && !write_fault && !exec_fault) {
> > +             kvm_err("Unexpected L2 read permission error\n");
> > +             return -EFAULT;
> > +     }
> > +
> > +     ret = kvm_gmem_get_pfn(vcpu->kvm, memslot, gfn, &pfn, &page, NULL);
> > +     if (ret) {
> > +             kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
> > +                                           write_fault, exec_fault, false);
> > +             return ret;
> > +     }
> > +
> > +     writable = !(memslot->flags & KVM_MEM_READONLY) &&
> > +                (!logging || write_fault);
> > +
> > +     if (writable)
> > +             prot |= KVM_PGTABLE_PROT_W;
> > +
> > +     if (exec_fault || cpus_have_final_cap(ARM64_HAS_CACHE_DIC))
> > +             prot |= KVM_PGTABLE_PROT_X;
> > +
> > +     pgt = vcpu->arch.hw_mmu->pgt;
>
> Can probably also initialize directly above.

Ack.

> > +
> > +     kvm_fault_lock(kvm);
> > +     if (is_perm) {
> > +             /*
> > +              * Drop the SW bits in favour of those stored in the
> > +              * PTE, which will be preserved.
> > +              */
> > +             prot &= ~KVM_NV_GUEST_MAP_SZ;
> > +             ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot, flags);
> > +     } else {
> > +             ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE,
> > +                                          __pfn_to_phys(pfn), prot,
> > +                                          memcache, flags);
> > +     }
> > +     kvm_release_faultin_page(kvm, page, !!ret, writable);
> > +     kvm_fault_unlock(kvm);
> > +
> > +     if (writable && !ret)
> > +             mark_page_dirty_in_slot(kvm, memslot, gfn);
> > +
> > +     return ret != -EAGAIN ? ret : 0;
> > +}
> > +
>
> Nothing else jumped at me. But just like on the x86 code, I think we
> need some arch experts take a look at this one ...

Thanks!
/fuad

> --
> Cheers,
>
> David / dhildenb
>