Ackerley Tng <ackerleytng@xxxxxxxxxx> writes: > Sean Christopherson <seanjc@xxxxxxxxxx> writes: > >> On Wed, Jul 23, 2025, Xiaoyao Li wrote: >>> On 7/23/2025 6:47 PM, Fuad Tabba wrote: >> >> ... >> >>> > + if (max_level == PG_LEVEL_4K) >>> > + return max_level; >>> > + >>> > + return min(max_level, >>> > + kvm_x86_call(gmem_max_mapping_level)(kvm, pfn)); >>> > } >>> >>> I don't mean to want a next version. >>> >>> But I have to point it out that, the coco_level stuff in the next patch >>> should be put in this patch actually. Because this patch does the wrong >>> thing to change from >>> >>> req_max_level = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn); >>> if (req_max_level) >>> max_level = min(max_level, req_max_level); >>> >>> to >>> >>> return min(max_level, >>> kvm_x86_call(gmem_max_mapping_level)(kvm, pfn)); >> >> Gah, nice catch. Let's do one more version (knock wood). I have no objection >> to fixing up my own goof, but the selftest needs to be reworked too, and I think >> it makes sense for Paolo to grab this directly. The fewer "things" we need to >> handoff to Paolo, the better. >> >> The fixup will generate a minor conflict, but it's trivial to resolve, and the >> resting state should end up identical. >> >> As fixup: >> >> --- >> arch/x86/kvm/mmu/mmu.c | 14 +++++++++++--- >> 1 file changed, 11 insertions(+), 3 deletions(-) >> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >> index 6148cc96f7d4..c4ff8b4028df 100644 >> --- a/arch/x86/kvm/mmu/mmu.c >> +++ b/arch/x86/kvm/mmu/mmu.c >> @@ -3305,9 +3305,9 @@ static u8 kvm_max_level_for_order(int order) >> static u8 kvm_max_private_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault, >> const struct kvm_memory_slot *slot, gfn_t gfn) >> { >> + u8 max_level, coco_level; >> struct page *page; >> kvm_pfn_t pfn; >> - u8 max_level; >> >> /* For faults, use the gmem information that was resolved earlier. */ >> if (fault) { >> @@ -3331,8 +3331,16 @@ static u8 kvm_max_private_mapping_level(struct kvm *kvm, struct kvm_page_fault * >> if (max_level == PG_LEVEL_4K) >> return max_level; >> >> - return min(max_level, >> - kvm_x86_call(gmem_max_mapping_level)(kvm, pfn)); >> + /* >> + * CoCo may influence the max mapping level, e.g. due to RMP or S-EPT >> + * restrictions. A return of '0' means "no additional restrictions", to >> + * allow for using an optional "ret0" static call. >> + */ >> + coco_level = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn); >> + if (coco_level) >> + max_level = min(max_level, coco_level); >> + >> + return max_level; >> } >> >> int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault, >> >> base-commit: f937c99dad18339773f18411f2a0193b5db8b581 >> -- >> >> Or a full patch: >> >> From: Sean Christopherson <seanjc@xxxxxxxxxx> >> Date: Wed, 23 Jul 2025 11:47:06 +0100 >> Subject: [PATCH] KVM: x86/mmu: Enforce guest_memfd's max order when recovering >> hugepages >> >> Rework kvm_mmu_max_mapping_level() to consult guest_memfd (and relevant) >> vendor code when recovering hugepages, e.g. after disabling live migration. >> The flaw has existed since guest_memfd was originally added, but has gone >> unnoticed due to lack of guest_memfd hugepage support. >> >> Get all information on-demand from the memslot and guest_memfd instance, >> even though KVM could pull the pfn from the SPTE. However, the max >> order/level needs to come from guest_memfd, and using kvm_gmem_get_pfn() >> avoids adding a new gmem API, and avoids having to retrieve the pfn and >> plumb it into kvm_mmu_max_mapping_level() (the pfn is needed for SNP to >> consult the RMP). >> >> Note, calling kvm_mem_is_private() in the non-fault path is safe, so long >> as mmu_lock is held, as hugepage recovery operates on shadow-present SPTEs, >> i.e. calling kvm_mmu_max_mapping_level() with @fault=NULL is mutually >> exclusive with kvm_vm_set_mem_attributes() changing the PRIVATE attribute >> of the gfn. >> >> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> >> --- >> arch/x86/kvm/mmu/mmu.c | 91 ++++++++++++++++++++------------- >> arch/x86/kvm/mmu/mmu_internal.h | 2 +- >> arch/x86/kvm/mmu/tdp_mmu.c | 2 +- >> 3 files changed, 58 insertions(+), 37 deletions(-) >> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >> index 20dd9f64156e..c4ff8b4028df 100644 >> --- a/arch/x86/kvm/mmu/mmu.c >> +++ b/arch/x86/kvm/mmu/mmu.c >> @@ -3302,31 +3302,63 @@ static u8 kvm_max_level_for_order(int order) >> return PG_LEVEL_4K; >> } >> >> -static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, >> - u8 max_level, int gmem_order) >> +static u8 kvm_max_private_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault, >> + const struct kvm_memory_slot *slot, gfn_t gfn) > > Would you consider renaming this kvm_max_gmem_mapping_level()? Or > something that doesn't limit the use of this function to private memory? > >> { >> - u8 req_max_level; >> + u8 max_level, coco_level; >> + struct page *page; >> + kvm_pfn_t pfn; >> >> - if (max_level == PG_LEVEL_4K) >> - return PG_LEVEL_4K; >> + /* For faults, use the gmem information that was resolved earlier. */ >> + if (fault) { >> + pfn = fault->pfn; >> + max_level = fault->max_level; >> + } else { >> + /* TODO: Constify the guest_memfd chain. */ >> + struct kvm_memory_slot *__slot = (struct kvm_memory_slot *)slot; >> + int max_order, r; >> + >> + r = kvm_gmem_get_pfn(kvm, __slot, gfn, &pfn, &page, &max_order); >> + if (r) >> + return PG_LEVEL_4K; >> + >> + if (page) >> + put_page(page); > > When I was working on this, I added a kvm_gmem_mapping_order() [1] where > guest_memfd could return the order that this gfn would be allocated at > without actually doing the allocation. Is it okay that an > allocation may be performed here? > > [1] https://lore.kernel.org/all/20250717162731.446579-13-tabba@xxxxxxxxxx/ > >> + >> + max_level = kvm_max_level_for_order(max_order); >> + } >> >> - max_level = min(kvm_max_level_for_order(gmem_order), max_level); >> if (max_level == PG_LEVEL_4K) >> - return PG_LEVEL_4K; >> + return max_level; > > I think the above line is a git-introduced issue, there probably > shouldn't be a return here. > My bad, this is a correct short-circuiting of the rest of the function since there's no smaller PG_LEVEL than PG_LEVEL_4K. >> >> - req_max_level = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn); >> - if (req_max_level) >> - max_level = min(max_level, req_max_level); >> + /* >> + * CoCo may influence the max mapping level, e.g. due to RMP or S-EPT >> + * restrictions. A return of '0' means "no additional restrictions", to >> + * allow for using an optional "ret0" static call. >> + */ >> + coco_level = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn); >> + if (coco_level) >> + max_level = min(max_level, coco_level); >> > > This part makes sense :) > >> return max_level; >> } >> >> -static int __kvm_mmu_max_mapping_level(struct kvm *kvm, >> - const struct kvm_memory_slot *slot, >> - gfn_t gfn, int max_level, bool is_private) >> +int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault, >> + const struct kvm_memory_slot *slot, gfn_t gfn) >> { >> struct kvm_lpage_info *linfo; >> - int host_level; >> + int host_level, max_level; >> + bool is_private; >> + >> + lockdep_assert_held(&kvm->mmu_lock); >> + >> + if (fault) { >> + max_level = fault->max_level; >> + is_private = fault->is_private; >> + } else { >> + max_level = PG_LEVEL_NUM; >> + is_private = kvm_mem_is_private(kvm, gfn); >> + } >> >> max_level = min(max_level, max_huge_page_level); >> for ( ; max_level > PG_LEVEL_4K; max_level--) { >> @@ -3335,25 +3367,16 @@ static int __kvm_mmu_max_mapping_level(struct kvm *kvm, >> break; >> } >> >> + if (max_level == PG_LEVEL_4K) >> + return PG_LEVEL_4K; >> + >> if (is_private) >> - return max_level; >> - >> - if (max_level == PG_LEVEL_4K) >> - return PG_LEVEL_4K; >> - >> - host_level = host_pfn_mapping_level(kvm, gfn, slot); >> + host_level = kvm_max_private_mapping_level(kvm, fault, slot, gfn); >> + else >> + host_level = host_pfn_mapping_level(kvm, gfn, slot); >> return min(host_level, max_level); >> } >> >> -int kvm_mmu_max_mapping_level(struct kvm *kvm, >> - const struct kvm_memory_slot *slot, gfn_t gfn) >> -{ >> - bool is_private = kvm_slot_has_gmem(slot) && >> - kvm_mem_is_private(kvm, gfn); >> - >> - return __kvm_mmu_max_mapping_level(kvm, slot, gfn, PG_LEVEL_NUM, is_private); >> -} >> - >> void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) >> { >> struct kvm_memory_slot *slot = fault->slot; >> @@ -3374,9 +3397,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault >> * Enforce the iTLB multihit workaround after capturing the requested >> * level, which will be used to do precise, accurate accounting. >> */ >> - fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot, >> - fault->gfn, fault->max_level, >> - fault->is_private); >> + fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, fault, >> + fault->slot, fault->gfn); >> if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed) >> return; >> >> @@ -4564,8 +4586,7 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu, >> } >> >> fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY); >> - fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->pfn, >> - fault->max_level, max_order); >> + fault->max_level = kvm_max_level_for_order(max_order); >> >> return RET_PF_CONTINUE; >> } >> @@ -7165,7 +7186,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, >> * mapping if the indirect sp has level = 1. >> */ >> if (sp->role.direct && >> - sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn)) { >> + sp->role.level < kvm_mmu_max_mapping_level(kvm, NULL, slot, sp->gfn)) { >> kvm_zap_one_rmap_spte(kvm, rmap_head, sptep); >> >> if (kvm_available_flush_remote_tlbs_range()) >> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h >> index 65f3c89d7c5d..b776be783a2f 100644 >> --- a/arch/x86/kvm/mmu/mmu_internal.h >> +++ b/arch/x86/kvm/mmu/mmu_internal.h >> @@ -411,7 +411,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, >> return r; >> } >> >> -int kvm_mmu_max_mapping_level(struct kvm *kvm, >> +int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault, >> const struct kvm_memory_slot *slot, gfn_t gfn); >> void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); >> void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level); >> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c >> index 7f3d7229b2c1..740cb06accdb 100644 >> --- a/arch/x86/kvm/mmu/tdp_mmu.c >> +++ b/arch/x86/kvm/mmu/tdp_mmu.c >> @@ -1813,7 +1813,7 @@ static void recover_huge_pages_range(struct kvm *kvm, >> if (iter.gfn < start || iter.gfn >= end) >> continue; >> >> - max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn); >> + max_mapping_level = kvm_mmu_max_mapping_level(kvm, NULL, slot, iter.gfn); >> if (max_mapping_level < iter.level) >> continue; >> >> >> base-commit: 84ca709e4f4d54aae3b8d4df74490d8d3d2b1272 >> --