On Wed, May 14, 2025 at 06:41:10AM -0700, Sean Christopherson wrote: > On Fri, May 02, 2025, Kirill A. Shutemov wrote: > > This RFC patchset enables Dynamic PAMT in TDX. It is not intended to be > > applied, but rather to receive early feedback on the feature design and > > enabling. > > In that case, please describe the design, and specifically *why* you chose this > particular design, along with the constraints and rules of dynamic PAMTs that > led to that decision. It would also be very helpful to know what options you > considered and discarded, so that others don't waste time coming up with solutions > that you already rejected. Dynamic PAMT support in TDX module ================================== Dynamic PAMT is a TDX feature that allows VMM to allocate PAMT_4K as needed. PAMT_1G and PAMT_2M are still allocated statically at the time of TDX module initialization. At init stage allocation of PAMT_4K is replaced with PAMT_PAGE_BITMAP which currently requires one bit of memory per 4k. VMM is responsible for allocating and freeing PAMT_4K. There's a pair of new SEAMCALLs for it: TDH.PHYMEM.PAMT.ADD and TDH.PHYMEM.PAMT.REMOVE. They add/remove PAMT memory in form of page pair. There's no requirement for these pages to be contiguous. Page pair supplied via TDH.PHYMEM.PAMT.ADD will cover specified 2M region. It allows any 4K from the region to be usable by TDX module. With Dynamic PAMT, a number of SEAMCALLs can now fail due to missing PAMT memory (TDX_MISSING_PAMT_PAGE_PAIR): - TDH.MNG.CREATE - TDH.MNG.ADDCX - TDH.VP.ADDCX - TDH.VP.CREATE - TDH.MEM.PAGE.ADD - TDH.MEM.PAGE.AUG - TDH.MEM.PAGE.DEMOTE - TDH.MEM.PAGE.RELOCATE Basically, if you supply memory to a TD, this memory has to backed by PAMT memory. Once no TD uses the 2M range, the PAMT page pair can be reclaimed with TDH.PHYMEM.PAMT.REMOVE. TDX module track PAMT memory usage and can give VMM a hint that PAMT memory can be removed. Such hint is provided from all SEAMCALLs that removes memory from TD: - TDH.MEM.SEPT.REMOVE - TDH.MEM.PAGE.REMOVE - TDH.MEM.PAGE.PROMOTE - TDH.MEM.PAGE.RELOCATE - TDH.PHYMEM.PAGE.RECLAIM With Dynamic PAMT, TDH.MEM.PAGE.DEMOTE takes PAMT page pair as additional input to populate PAMT_4K on split. TDH.MEM.PAGE.PROMOTE returns no longer needed PAMT page pair. PAMT memory is global resource and not tied to a specific TD. TDX modules maintains PAMT memory in a radix tree addressed by physical address. Each entry in the tree can be locked with shared or exclusive lock. Any modification of the tree requires exclusive lock. Any SEAMCALL that takes explicit HPA as an argument will walk the tree taking shared lock on entries. It required to make sure that the page pointed by HPA is of compatible type for the usage. TDCALLs don't take PAMT locks as none of the take HPA as an argument. Dynamic PAMT enabling in kernel =============================== Kernel maintains refcounts for every 2M regions with two helpers tdx_pamt_get() and tdx_pamt_put(). The refcount represents number of users for the PAMT memory in the region. Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and TDH.PHYMEM.PAMT.REMOVE on transition 1->0. PAMT memory gets allocated as part of TD init, VCPU init, on populating SEPT tree and adding guest memory (both during TD build and via AUG on accept). PAMT memory removed on reclaim of control pages and guest memory. Populating PAMT memory on fault is tricky as we cannot allocate memory from the context where it is needed. I introduced a pair of kvm_x86_ops to allocate PAMT memory from a per-VCPU pool from context where VCPU is still around and free it on failuire. This flow will likely be reworked in next versions. Previous attempt on Dynamic PAMT enabling ========================================= My initial kernel enabling attempt was quite different. I wanted to make PAMT allocation lazy: only try to add PAMT page pair if a SEAMCALL fails due to missing PAMT and reclaim it back based on hint provided by the TDX module. The motivation was to avoid duplication of PAMT memory refcounting that TDX module does on kernel side. This approach is inherently more racy as we don't serialize PAMT memory add/remove against SEAMCALLs that uses add/remove memory for a TD. Such serialization would require global locking which is no-go. I made this approach work, but at some point I realized that it cannot be robust as long as we want to avoid TDX_OPERAND_BUSY loops. TDX_OPERAND_BUSY will pop up as result of the races I mentioned above. I gave up on this approach and went with the current one which uses explicit refcounting. Brain dumped. Let me know if anything is unclear. -- Kiryl Shutsemau / Kirill A. Shutemov