On Thu, 2025-05-15 at 16:26 +0800, Yan Zhao wrote: > On Wed, May 14, 2025 at 02:19:56AM +0800, Edgecombe, Rick P wrote: > > On Thu, 2025-04-24 at 11:04 +0800, Yan Zhao wrote: > > > From: Xiaoyao Li <xiaoyao.li@xxxxxxxxx> > > > > > > Add a wrapper tdh_mem_page_demote() to invoke SEAMCALL TDH_MEM_PAGE_DEMOTE > > > to demote a huge leaf entry to a non-leaf entry in S-EPT. Currently, the > > > TDX module only supports demotion of a 2M huge leaf entry. After a > > > successful demotion, the old 2M huge leaf entry in S-EPT is replaced with a > > > non-leaf entry, linking to the newly-added page table page. The newly > > > linked page table page then contains 512 leaf entries, pointing to the 2M > > > guest private pages. > > > > > > The "gpa" and "level" direct the TDX module to search and find the old > > > huge leaf entry. > > > > > > As the new non-leaf entry points to a page table page, callers need to > > > pass in the page table page in parameter "page". > > > > > > In case of S-EPT walk failure, the entry, level and state where the error > > > was detected are returned in ext_err1 and ext_err2. > > > > > > On interrupt pending, SEAMCALL TDH_MEM_PAGE_DEMOTE returns error > > > TDX_INTERRUPTED_RESTARTABLE. > > > > > > [Yan: Rebased and split patch, wrote changelog] > > > > We should add the level of detail here like we did for the base series ones. > I'll provide changelog details under "---" of each patch in the next version. I mean the commit log (above the "---") needs the same tip style treatment as the other SEAMCALL wrapper patches. > > > > > > > Signed-off-by: Xiaoyao Li <xiaoyao.li@xxxxxxxxx> > > > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > > > --- > > > arch/x86/include/asm/tdx.h | 2 ++ > > > arch/x86/virt/vmx/tdx/tdx.c | 20 ++++++++++++++++++++ > > > arch/x86/virt/vmx/tdx/tdx.h | 1 + > > > 3 files changed, 23 insertions(+) > > > > > > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h > > > index 26ffc792e673..08eff4b2f5e7 100644 > > > --- a/arch/x86/include/asm/tdx.h > > > +++ b/arch/x86/include/asm/tdx.h > > > @@ -177,6 +177,8 @@ u64 tdh_mng_key_config(struct tdx_td *td); > > > u64 tdh_mng_create(struct tdx_td *td, u16 hkid); > > > u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp); > > > u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data); > > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page, > > > + u64 *ext_err1, u64 *ext_err2); > > > u64 tdh_mr_extend(struct tdx_td *td, u64 gpa, u64 *ext_err1, u64 *ext_err2); > > > u64 tdh_mr_finalize(struct tdx_td *td); > > > u64 tdh_vp_flush(struct tdx_vp *vp); > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > > > index a66d501b5677..5699dfe500d9 100644 > > > --- a/arch/x86/virt/vmx/tdx/tdx.c > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c > > > @@ -1684,6 +1684,26 @@ u64 tdh_mng_rd(struct tdx_td *td, u64 field, u64 *data) > > > } > > > EXPORT_SYMBOL_GPL(tdh_mng_rd); > > > > > > +u64 tdh_mem_page_demote(struct tdx_td *td, u64 gpa, int level, struct page *page, > > > + u64 *ext_err1, u64 *ext_err2) > > > +{ > > > + struct tdx_module_args args = { > > > + .rcx = gpa | level, > > > > This will only ever be level 2MB, how about dropping the arg? > Do you mean hardcoding level to be 2MB in tdh_mem_page_demote()? Yea, we don't support 1GB, so the level arg on the wrapper is superfluous. > > The SEAMCALL TDH_MEM_PAGE_DEMOTE supports level of 1GB in current TDX module. > > > > + .rdx = tdx_tdr_pa(td), > > > + .r8 = page_to_phys(page), > > > + }; > > > + u64 ret; > > > + > > > + tdx_clflush_page(page); > > > + ret = seamcall_ret(TDH_MEM_PAGE_DEMOTE, &args); > > > + > > > + *ext_err1 = args.rcx; > > > + *ext_err2 = args.rdx; > > > > How about we just call these entry and level_state, like the caller. > Not sure, but I feel that ext_err* might be better, because > - according to the spec, > a) the args.rcx, args.rdx is unmodified (i.e. still hold the passed-in value) > in case of error TDX_INTERRUPTED_RESTARTABLE. > b) args.rcx, args.rdx can only be interpreted as entry and level_state in case > of EPT walk error. > c) in other cases, they are 0. > - consistent with tdh_mem_page_aug(), tdh_mem_range_block()... Yea, it's consistent. I'm ok leaving it as is. > > > > > + > > > + return ret; > > > +} > > > +EXPORT_SYMBOL_GPL(tdh_mem_page_demote); > > > > Looking in the docs, TDX module gives some somewhat constrained guidance: > > 1. TDH.MEM.PAGE.DEMOTE should be invoked in a loop until it terminates > > successfully. > > 2. The host VMM should be designed to avoid cases where interrupt storms prevent > > successful completion of TDH.MEM.PAGE.DEMOTE. > > > > The caller looks like: > > do { > > err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page, > > &entry, &level_state); > > } while (err == TDX_INTERRUPTED_RESTARTABLE); > > > > if (unlikely(tdx_operand_busy(err))) { > > tdx_no_vcpus_enter_start(kvm); > > err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page, > > &entry, &level_state); > > tdx_no_vcpus_enter_stop(kvm); > > } > > > > The brute force second case could also be subjected to a > > TDX_INTERRUPTED_RESTARTABLE and is not handled. As for interrupt storms, I guess > You are right. > > > we could disable interrupts while we do the second brute force case. So the > > TDX_INTERRUPTED_RESTARTABLE loop could have a max retries, and the brute force > > case could also disable interrupts. > Good idea. > > > Hmm, how to pick the max retries count. It's a tradeoff between interrupt > > latency and DOS/code complexity. Do we have any idea how long demote might take? > I did a brief test on my SPR, where the host was not busy : > tdh_mem_page_demote() was called 142 times, with each invocation taking around > 10us. 10us doesn't seem too bad? Makes me think to not loop and instead just do a single retry with interrupts disabled. We should definitely add the data based reasoning to the log. The counter point is that the SEAMCALL must be supporting TDX_INTERRUPTED_RESTARTABLE for a reason. And the reason probably is that it sometimes takes longer than someone that was reasonable. Maybe we should ask TDX module folks if there is any history. > 2 invocations were due to TDX_INTERRUPTED_RESTARTABLE. > For each GFN, at most 1 retry was performed. > > I will do more investigations.