Re: [PATCH v8 06/13] KVM: x86: Generalize private fault lookups to guest_memfd fault lookups

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Mon, 05 May 2025 16:09:58 -0700

David Hildenbrand <david@xxxxxxxxxx> writes:

> On 03.05.25 00:00, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
>> 
>>> On Fri, May 02, 2025, David Hildenbrand wrote:
>>>> On 30.04.25 20:58, Ackerley Tng wrote:
>>>>>> -	if (is_private)
>>>>>> +	if (is_gmem)
>>>>>>    		return max_level;
>>>>>
>>>>> I think this renaming isn't quite accurate.
>>>>
>>>> After our discussion yesterday, does that still hold true?
>>>
>>> No.
>>>
>>>>> IIUC in __kvm_mmu_max_mapping_level(), we skip considering
>>>>> host_pfn_mapping_level() if the gfn is private because private memory
>>>>> will not be mapped to userspace, so there's no need to query userspace
>>>>> page tables in host_pfn_mapping_level().
>>>>
>>>> I think the reason was that: for private we won't be walking the user space
>>>> pages tables.
>>>>
>>>> Once guest_memfd is also responsible for the shared part, why should this
>>>> here still be private-only, and why should we consider querying a user space
>>>> mapping that might not even exist?
>>>
>>> +1, one of the big selling points for guest_memfd beyond CoCo is that it provides
>>> guest-first memory.  It is very explicitly an intended feature that the guest
>>> mappings KVM creates can be a superset of the host userspace mappings.  E.g. the
>>> guest can use larger page sizes, have RW while the host has RO, etc.
>> 
>> Do you mean that __kvm_mmu_max_mapping_level() should, in addition to
>> the parameter renaming from is_private to is_gmem, do something like
>> 
>> if (is_gmem)
>> 	return kvm_gmem_get_max_mapping_level(slot, gfn);
>
> I assume you mean, not looking at lpage_info at all?
>

My bad. I actually meant just to take input from guest_memfd and stop
there without checking with host page tables, perhaps something like
min(kvm_gmem_get_max_mapping_level(slot, gfn), max_level);

> I have limited understanding what lpage_info is or what it does. I 
> believe all it adds is a mechanism to *disable* large page mappings.
>

This is my understanding too.

> We want to disable large pages if (using 2M region as example)
>
> (a) Mixed memory attributes. If a PFN falls into a 2M region, and parts
>      of that region are shared vs. private (mixed memory attributes ->
>      KVM_LPAGE_MIXED_FLAG)
>
>   -> With gmem-shared we could have mixed memory attributes, not a PFN
>      fracturing. (PFNs don't depend on memory attributes)
>
> (b) page track: intercepting (mostly write) access to GFNs
>

Could you explain more about page track case? 

>
> So, I wonder if we still have to take care of lpage_info, at least for
> handling (b) correctly [I assume so]. Regarding (a) I am not sure: once 
> memory attributes are handled by gmem in the gmem-shared case. IIRC, 
> with AMD SEV we might still have to honor it? But gmem itself could 
> handle that.
>

For AMD SEV, I believe kvm_max_private_mapping_level() already takes
care of that, at least for the MMU faulting path [1], where guest_memfd
gives input using max_order, then arch-specific callback contributes input.

>
> What we could definitely do here for now is:
>
> if (is_gmem)
> 	/* gmem only supports 4k pages for now. */
> 	return PG_LEVEL_4K;
>
> And not worry about lpage_infor for the time being, until we actually do 
> support larger pages.
>
>

Perhaps this is better explained as an RFC in code. I'll put in a patch
as part of Fuad's series if Fuad doesn't mind.

>> 
>> and basically defer to gmem as long as gmem should be used for this gfn?
>> 
>> There is another call to __kvm_mmu_max_mapping_level() via
>> kvm_mmu_max_mapping_level() beginning from recover_huge_pages_range(),
>> and IIUC that doesn't go through guest_memfd.
>> 
>> Hence, unlike the call to __kvm_mmu_max_mapping_level() from the KVM x86
>> MMU fault path, guest_memfd didn't get a chance to provide its input in
>> the form of returning max_order from kvm_gmem_get_pfn().
>
> Right, we essentially say that "this is a private fault", likely 
> assuming that we already verified earlier that the memory is also private.
>
> [I can see that happening when the function is called through 
> direct_page_fault()]
>
> We could simply call kvm_mmu_max_mapping_level() from 
> kvm_mmu_hugepage_adjust() I guess. (could possibly be optimized later)
>
> -- 
> Cheers,
>
> David / dhildenb

[1] https://github.com/torvalds/linux/blob/01f95500a162fca88cefab9ed64ceded5afabc12/arch/x86/kvm/mmu/mmu.c#L4480