On Sat, May 17, 2025 at 06:02:14AM +0800, Edgecombe, Rick P wrote: > On Fri, 2025-05-16 at 14:30 +0800, Yan Zhao wrote: > > > Looking more closely, I don't see why it's too hard to pass in a > > > max_fault_level > > > into the fault struct. Totally untested rough idea, what do you think? > > Thanks for bringing this up and providing the idea below. In the previous TDX > > huge page v8, there's a similar implementation [1] [2]. > > > > This series did not adopt that approach because that approach requires > > tdx_handle_ept_violation() to pass in max_fault_level, which is not always > > available at that stage. e.g. > > > > In patch 19, when vCPU 1 faults on a GFN at 2MB level and then vCPU 2 faults > > on > > the same GFN at 4KB level, TDX wants to ignore the demotion request caused by > > vCPU 2's 4KB level fault. So, patch 19 sets tdx->violation_request_level to > > 2MB > > in vCPU 2's split callback and fails the split. vCPU 2's > > __vmx_handle_ept_violation() will see RET_PF_RETRY and either do local retry > > (or > > return to the guest). > > I think you mean patch 20 "KVM: x86: Force a prefetch fault's max mapping level > to 4KB for TDX"? Sorry. It's patch 21 "KVM: x86: Ignore splitting huge pages in fault path for TDX" > > > > If it retries locally, tdx_gmem_private_max_mapping_level() will return > > tdx->violation_request_level, causing KVM to fault at 2MB level for vCPU 2, > > resulting in a spurious fault, eventually returning to the guest. > > > > As tdx->violation_request_level is per-vCPU and it resets in > > tdx_get_accept_level() in tdx_handle_ept_violation() (meaning it resets after > > each invocation of tdx_handle_ept_violation() and only affects the TDX local > > retry loop), it should not hold any stale value. > > > > Alternatively, instead of having tdx_gmem_private_max_mapping_level() to > > return > > tdx->violation_request_level, tdx_handle_ept_violation() could grab > > tdx->violation_request_level as the max_fault_level to pass to > > __vmx_handle_ept_violation(). > > > > This series chose to use tdx_gmem_private_max_mapping_level() to avoid > > modification to the KVM MMU core. > > It sounds like Kirill is suggesting we do have to have demotion in the fault > path. IIRC it adds a lock, but the cost to skip fault path demotion seems to be > adding up. Yes, though Kirill is suggesting to support demotion in the fault path, I still think that using tdx_gmem_private_max_mapping_level() might be more friendly to other potential scenarios, such as when the KVM core MMU requests TDX to perform page promotion, and TDX finds that promotion would consistently fail on a GFN. Another important reason for not passing a max_fault_level into the fault struct is that the KVM MMU now has the hook private_max_mapping_level to determine a private fault's maximum level, which was introduced by commit f32fb32820b1 ("KVM: x86: Add hook for determining max NPT mapping level"). We'd better not to introduce another mechanism if the same job can be accomplished via the private_max_mapping_level hook. The code in TDX huge page v8 [1][2] simply inherited the old implementation from its v1 [3], where the private_max_mapping_level hook had not yet been introduced for private faults. [1] https://lore.kernel.org/all/4d61104bff388a081ff8f6ae4ac71e05a13e53c3.1708933624.git.isaku.yamahata@xxxxxxxxx/ [2] https://lore.kernel.org/all/3d2a6bfb033ee1b51f7b875360bd295376c32b54.1708933624.git.isaku.yamahata@xxxxxxxxx/ [3] https://lore.kernel.org/all/cover.1659854957.git.isaku.yamahata@xxxxxxxxx/