On Mon, 2025-06-09 at 22:13 +0300, Kirill A. Shutemov wrote: > Dynamic PAMT enabling in kernel > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Kernel maintains refcounts for every 2M regions with two helpers > tdx_pamt_get() and tdx_pamt_put(). > > The refcount represents number of users for the PAMT memory in the region. > Kernel calls TDH.PHYMEM.PAMT.ADD on 0->1 transition and > TDH.PHYMEM.PAMT.REMOVE on transition 1->0. > > The function tdx_alloc_page() allocates a new page and ensures that it is > backed by PAMT memory. Pages allocated in this manner are ready to be used > for a TD. The function tdx_free_page() frees the page and releases the > PAMT memory for the 2M region if it is no longer needed. > > PAMT memory gets allocated as part of TD init, VCPU init, on populating > SEPT tree and adding guest memory (both during TD build and via AUG on > accept). Splitting 2M page into 4K also requires PAMT memory. > > PAMT memory removed on reclaim of control pages and guest memory. > > Populating PAMT memory on fault and on split is tricky as kernel cannot > allocate memory from the context where it is needed. These code paths use > pre-allocated PAMT memory pools. > > Previous attempt on Dynamic PAMT enabling > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The initial attempt at kernel enabling was quite different. It was built > around lazy PAMT allocation: only trying to add a PAMT page pair if a > SEAMCALL fails due to a missing PAMT and reclaiming it based on hints > provided by the TDX module. > > The motivation was to avoid duplicating the PAMT memory refcounting that > the TDX module does on the kernel side. > > This approach is inherently more racy as there is no serialization of > PAMT memory add/remove against SEAMCALLs that add/remove memory for a TD. > Such serialization would require global locking, which is not feasible. > > This approach worked, but at some point it became clear that it could not > be robust as long as the kernel avoids TDX_OPERAND_BUSY loops. > TDX_OPERAND_BUSY will occur as a result of the races mentioned above. > > This approach was abandoned in favor of explicit refcounting. On closer inspection this new solution also has global locking. It opportunistically checks to see if there is already a refcount, but otherwise when a PAMT page actually has to be added there is a global spinlock while the PAMT add/remove SEAMCALL is made. I guess this is going to get taken somewhere around once per 512 4k private pages, but when it does it has some less ideal properties: - Cache line bouncing of the lock between all TDs on the host - An global exclusive lock deep inside the TDP MMU shared lock fault path - Contend heavily when two TD's shutting down at the same time? As for why not only do the lock as a backup option like the kick+lock solution in KVM, the problem would be losing the refcount race and ending up with a PAMT page getting released early. As far as TDX module locking is concerned (i.e. BUSY error codes from pamt add/remove), it seems this would only happen if pamt add/remove operate simultaneously on the same 2MB HPA region. That is completely prevented by the refcount and global lock, but it's a bit heavyweight. It prevents simultaneously adding totally separate 2MB regions when we only would need to prevent simultaneously operating on the same 2MB region. I don't see any other reason for the global spin lock, Kirill was that it? Did you consider also adding a lock per 2MB region, like the refcount? Or any other granularity of lock besides global? Not saying global is definitely the wrong choice, but seems arbitrary if I got the above right.