On Tue, May 20, 2025 at 12:53:33AM +0800, Edgecombe, Rick P wrote: > On Mon, 2025-05-19 at 16:32 +0800, Yan Zhao wrote: > > > On the opposite, if other non-Linux TDs don't follow 1G->2M->4K accept > > > order, > > > e.g., they always accept 4K, there could be *endless EPT violation* if I > > > understand your words correctly. > > > > > > Isn't this yet-another reason we should choose to return PG_LEVEL_4K instead > > > of > > > 2M if no accept level is provided in the fault? > > As I said, returning PG_LEVEL_4K would disallow huge pages for non-Linux TDs. > > TD's accept operations at size > 4KB will get TDACCEPT_SIZE_MISMATCH. > > TDX_PAGE_SIZE_MISMATCH is a valid error code that the guest should handle. The > docs say the VMM needs to demote *if* the mapping is large and the accept size > is small. But if we map at 4k size for non-accept EPT violations, we won't hit > this case. I also wonder what is preventing the TDX module from handling a 2MB > accept size at 4k mappings. It could be changed maybe. > > But I think Kai's question was: why are we complicating the code for the case of > non-Linux TDs that also use #VE for accept? It's not necessary to be functional, > and there aren't any known TDs like that which are expected to use KVM today. > (err, except the MMU stress test). So in another form the question is: should we > optimize KVM for a case we don't even know if anyone will use? The answer seems > obviously no to me. So, you want to disallow huge pages for non-Linux TDs, then we have no need to support splitting in the fault path, right? I'm OK if we don't care non-Linux TDs for now. This can simplify the splitting code and we can add the support when there's a need. > I think this connects the question of whether we can pass the necessary info > into fault via synthetic error code. Consider this new design: > > - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre- > runnable, otherwise returns 2MB Why prefetch and pre-runnable faults go the first path, while > - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass > 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages > *yet*). other faults go the second path? > What goes wrong? Seems simpler and no more stuffing fault info on the vcpu. I tried to avoid the double paths. IMHO, it's confusing to specify max_level from two paths. The fault info in vcpu_tdx isn't a real problem as it's per-vCPU. An existing example in KVM is vcpu->arch.mmio_gfn. We don't need something like the vcpu->arch.mmio_gen because tdx->violation_gfn_* and tdx->violation_request_level are reset in each tdx_handle_ept_violation(). BTW, dug into some history: In v18 of TDX basic series, enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to kvm_mmu_map_tdp_page(). https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@xxxxxxxxx/ https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@xxxxxxxxx/ For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(), and Paolo asked to use the tdx_gmem_private_max_mapping_level() path. https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@xxxxxxxxxxxxxx/ For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level", it's initially acked by Paolo in v2, and Sean's reply is at https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@xxxxxxxxxx .