Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Tue, 20 May 2025 17:34:46 +0800

On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote:
> > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept
> > > order,
> > > e.g., they always accept 4K, there could be *endless EPT violation* if I
> > > understand your words correctly.
> > > 
> > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead
> > > of
> > > 2M if no accept level is provided in the fault?
> > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs.
> > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH.
> 
> TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The
> docs say the VMM needs to demote *if* the mapping is large and the accept size
> is small. But if we map at 4k size for non-accept EPT violations, we won't hit
> this case. I also wonder what is preventing the TDX module from handling a 2MB
> accept size at 4k mappings. It could be changed maybe.
> 
> But I think Kai's question was: why are we complicating the code for the case of
> non-Linux TDs that also use #VE for accept? It's not necessary to be functional,
> and there aren't any known TDs like that which are expected to use KVM today.
> (err, except the MMU stress test). So in another form the question is: should we
> optimize KVM for a case we don't even know if anyone will use? The answer seems
> obviously no to me.
So, you want to disallow huge pages for non-Linux TDs, then we have no need
to support splitting in the fault path, right?

I'm OK if we don't care non-Linux TDs for now.
This can simplify the splitting code and we can add the support when there's a
need.

> I think this connects the question of whether we can pass the necessary info
> into fault via synthetic error code. Consider this new design:
> 
>  - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> runnable, otherwise returns 2MB
Why prefetch and pre-runnable faults go the first path, while

>  - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> *yet*).
other faults go the second path?

> What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
I tried to avoid the double paths.
IMHO, it's confusing to specify max_level from two paths.

The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
An existing example in KVM is vcpu->arch.mmio_gfn.

We don't need something like the vcpu->arch.mmio_gen because
tdx->violation_gfn_* and tdx->violation_request_level are reset in each
tdx_handle_ept_violation().

BTW, dug into some history:

In v18 of TDX basic series,
enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
kvm_mmu_map_tdp_page().
https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@xxxxxxxxx/
https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@xxxxxxxxx/

For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@xxxxxxxxxxxxxx/

For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
it's initially acked by Paolo in v2, and Sean's reply is at
https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@xxxxxxxxxx .