On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote: > On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote: > > Basic huge page mapping/unmapping > > --------------------------------- > > - TD build time > > This series enforces that all private mappings be 4KB during the TD build > > phase, due to the TDX module's requirement that tdh_mem_page_add(), the > > SEAMCALL for adding private pages during TD build time, only supports 4KB > > mappings. Enforcing 4KB mappings also simplifies the implementation of > > code for TD build time, by eliminating the need to consider merging or > > splitting in the mirror page table during TD build time. > > > > The underlying pages allocated from guest_memfd during TD build time > > phase can still be large, allowing for potential merging into 2MB > > mappings once the TD is running. > > It can be done before TD is running. The merging is allowed on TD build > stage. > > But, yes, for simplicity we can skip it for initial enabling. Yes, to avoid complicating kvm_tdx->nr_premapped calculation. I also don't see any benefit to allow merging during TD build stage. > > > Page splitting (page demotion) > > ------------------------------ > > Page splitting occurs in two paths: > > (a) with exclusive kvm->mmu_lock, triggered by zapping operations, > > > > For normal VMs, if zapping a narrow region that would need to split a > > huge page, KVM can simply zap the surrounding GFNs rather than > > splitting a huge page. The pages can then be faulted back in, where KVM > > can handle mapping them at a 4KB level. > > > > The reason why TDX can't use the normal VM solution is that zapping > > private memory that is accepted cannot easily be re-faulted, since it > > can only be re-faulted as unaccepted. So KVM will have to sometimes do > > the page splitting as part of the zapping operations. > > > > These zapping operations can occur for few reasons: > > 1. VM teardown. > > 2. Memslot removal. > > 3. Conversion of private pages to shared. > > 4. Userspace does a hole punch to guest_memfd for some reason. > > > > For case 1 and 2, splitting before zapping is unnecessary because > > either the entire range will be zapped or huge pages do not span > > memslots. > > > > Case 3 or case 4 requires splitting, which is also followed by a > > backend page splitting in guest_memfd. > > > > (b) with shared kvm->mmu_lock, triggered by fault. > > > > Splitting in this path is not accompanied by a backend page splitting > > (since backend page splitting necessitates a splitting and zapping > > operation in the former path). It is triggered when KVM finds that a > > non-leaf entry is replacing a huge entry in the fault path, which is > > usually caused by vCPUs' concurrent ACCEPT operations at different > > levels. > > Hm. This sounds like funky behaviour on the guest side. > > You only saw it in a synthetic test, right? No real guest OS should do > this. Right. In selftest only. Also in case of any guest bugs. > It can only be possible if guest is reckless enough to be exposed to > double accept attacks. > > We should consider putting a warning if we detect such case on KVM side. Is it acceptable to put warnings in host kernel in case of guest bugs or attacks? > > This series simply ignores the splitting request in the fault path to > > avoid unnecessary bounces between levels. The vCPU that performs ACCEPT > > at a lower level would finally figures out the page has been accepted > > at a higher level by another vCPU. > > > > A rare case that could lead to splitting in the fault path is when a TD > > is configured to receive #VE and accesses memory before the ACCEPT > > operation. By the time a vCPU accesses a private GFN, due to the lack > > of any guest preferred level, KVM could create a mapping at 2MB level. > > If the TD then only performs the ACCEPT operation at 4KB level, > > splitting in the fault path will be triggered. However, this is not > > regarded as a typical use case, as usually TD always accepts pages in > > the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting > > splitting request is an endless EPT violation. This would not happen > > for a Linux guest, which does not expect any #VE. > > Even if guest accepts memory in response to #VE, it still has to serialize > ACCEPT requests to the same memory block. And track what has been > accepted. > > Double accept is a guest bug. In the rare case, there're no double accept. 1. Guest acceses a private GPA 2. KVM creates a 2MB mapping in PENDING state and returns to guest. 3. Guest re-accesses, causing the TDX module to inject a #VE. 4. Guest accepts at 4KB level only. 5. EPT violation to KVM for page splitting. Here, we expect a normal guest to accept from GB->2MB->4KB in step 4.