"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> writes: > On Tue, 2025-06-24 at 16:30 -0700, Ackerley Tng wrote: >> I see, let's call debug checking Topic 3 then, to separate it from Topic >> 1, which is TDX indicating that it is using a page for production >> kernels. >> >> Topic 3: How should TDX indicate use of a page for debugging? >> >> I'm okay if for debugging, TDX uses anything other than refcounts for >> checking, because refcounts will interfere with conversions. > > Ok. It can be follow on work I think. > Yup I agree. >> >> Rick's other email is correct. The correct link should be >> https://lore.kernel.org/all/aFJjZFFhrMWEPjQG@xxxxxxxxxxxxxxxxxxxxxxxxx/. >> >> [INTERFERE WITH CONVERSIONS] >> >> To summarize, if TDX uses refcounts to indicate that it is using a page, >> or to indicate anything else, then we cannot easily split a page on >> private to shared conversions. >> >> Specifically, consider the case where only the x-th subpage of a huge >> folio is mapped into Secure-EPTs. When the guest requests to convert >> some subpage to shared, the huge folio has to be split for >> core-mm. Core-mm, which will use the shared page, must have split folios >> to be able to accurately and separately track refcounts for subpages. >> >> During splitting, guest_memfd would see refcount of 512 (for 2M page >> being in the filemap) + 1 (if TDX indicates that the x-th subpage is >> mapped using a refcount), but would not be able to tell that the 513th >> refcount belongs to the x-th subpage. guest_memfd can't split the huge >> folio unless it knows how to distribute the 513th refcount. >> >> One might say guest_memfd could clear all the refcounts that TDX is >> holding on the huge folio by unmapping the entire huge folio from the >> Secure-EPTs, but unmapping the entire huge folio for TDX means zeroing >> the contents and requiring guest re-acceptance. Both of these would mess >> up guest operation. >> >> Hence, guest_memfd's solution is to require that users of guest_memfd >> for private memory trust guest_memfd to maintain the pages around and >> not take any refcounts. >> >> So back to Topic 1, for production kernels, is it okay that TDX does not >> need to indicate that it is using a page, and can trust guest_memfd to >> keep the page around for the VM? > > I think Yan's concern is not totally invalid. But I don't see a problem if we > have a line of sight to adding debug checking as follow on work. That is kind of > the path I was trying to nudge. > >> >> > > >> > > Topic 2: How to handle unmapping/splitting errors arising from TDX? >> > > >> > > Previously I was in favor of having unmap() return an error (Rick >> > > suggested doing a POC, and in a more recent email Rick asked for a >> > > diffstat), but Vishal and I talked about this and now I agree having >> > > unmapping return an error is not a good approach for these reasons. >> > >> > Ok, let's close this option then. >> > >> > > >> > > 1. Unmapping takes a range, and within the range there could be more >> > > than one unmapping error. I was previously thinking that unmap() >> > > could return 0 for success and the failed PFN on error. Returning a >> > > single PFN on error is okay-ish but if there are more errors it could >> > > get complicated. >> > > >> > > Another error return option could be to return the folio where the >> > > unmapping/splitting issue happened, but that would not be >> > > sufficiently precise, since a folio could be larger than 4K and we >> > > want to track errors as precisely as we can to reduce memory loss due >> > > to errors. >> > > >> > > 2. What I think Yan has been trying to say: unmap() returning an error >> > > is non-standard in the kernel. >> > > >> > > I think (1) is the dealbreaker here and there's no need to do the >> > > plumbing POC and diffstat. >> > > >> > > So I think we're all in support of indicating unmapping/splitting issues >> > > without returning anything from unmap(), and the discussed options are >> > > >> > > a. Refcounts: won't work - mostly discussed in this (sub-)thread >> > > [3]. Using refcounts makes it impossible to distinguish between >> > > transient refcounts and refcounts due to errors. >> > > >> > > b. Page flags: won't work with/can't benefit from HVO. >> > >> > As above, this was for the purpose of catching bugs, not for guestmemfd to >> > logically depend on it. >> > >> > > >> > > Suggestions still in the running: >> > > >> > > c. Folio flags are not precise enough to indicate which page actually >> > > had an error, but this could be sufficient if we're willing to just >> > > waste the rest of the huge page on unmapping error. >> > >> > For a scenario of TDX module bug, it seems ok to me. >> > >> > > >> > > d. Folio flags with folio splitting on error. This means that on >> > > unmapping/Secure EPT PTE splitting error, we have to split the >> > > (larger than 4K) folio to 4K, and then set a flag on the split folio. >> > > >> > > The issue I see with this is that splitting pages with HVO applied >> > > means doing allocations, and in an error scenario there may not be >> > > memory left to split the pages. >> > > >> > > e. Some other data structure in guest_memfd, say, a linked list, and a >> > > function like kvm_gmem_add_error_pfn(struct page *page) that would >> > > look up the guest_memfd inode from the page and add the page's pfn to >> > > the linked list. >> > > >> > > Everywhere in guest_memfd that does unmapping/splitting would then >> > > check this linked list to see if the unmapping/splitting >> > > succeeded. >> > > >> > > Everywhere in guest_memfd that allocates pages will also check this >> > > linked list to make sure the pages are functional. >> > > >> > > When guest_memfd truncates, if the page being truncated is on the >> > > list, retain the refcount on the page and leak that page. >> > >> > I think this is a fine option. >> > >> > > >> > > f. Combination of c and e, something similar to HugeTLB's >> > > folio_set_hugetlb_hwpoison(), which sets a flag AND adds the pages in >> > > trouble to a linked list on the folio. >> > > >> > > g. Like f, but basically treat an unmapping error as hardware poisoning. >> > > >> > > I'm kind of inclined towards g, to just treat unmapping errors as >> > > HWPOISON and buying into all the HWPOISON handling requirements. What do >> > > yall think? Can a TDX unmapping error be considered as memory poisoning? >> > >> > What does HWPOISON bring over refcounting the page/folio so that it never >> > returns to the page allocator? >> >> For Topic 2 (handling TDX unmapping errors), HWPOISON is better than >> refcounting because refcounting interferes with conversions (see >> [INTERFERE WITH CONVERSIONS] above). > > I don't know if it quite fits. I think it would be better to not pollute the > concept if possible. > If there's something we know for sure doesn't fit, and that we're overloading/polluting the concept of HWpoison, then we shouldn't proceed, but otherwise, is it okay to go with HWpoison as a first cut? I replied Yan's email with reasons why I think we should give HWpoison a try, at least for the next RFC. >> >> > We are bugging the TD in these cases. >> >> Bugging the TD does not help to prevent future conversions from being >> interfered with. >> >> 1. Conversions involves unmapping, so we could actually be in a >> conversion, the unmapping is performed and fails, and then we try to >> split and enter an infinite loop since private to shared conversions >> assumes guest_memfd holds the only refcounts on guest_memfd memory. >> >> 2. The conversion ioctl is a guest_memfd ioctl, not a VM ioctl, and so >> there is no check that the VM is not dead. There shouldn't be any >> check on the VM, because shareability is a property of the memory and >> should be changeable independent of the associated VM. > > Hmm, they are both about unlinking guestmemfd from a VM lifecycle then. Is that > a better way to put it? > Unmapping during conversions doesn't take memory away from a VM, it just forces the memory to be re-faulted as shared, so unlinking memory from a VM lifecycle isn't quite accurate, if I understand you correctly. >> >> > Ohhh... Is >> > this about the code to allow gmem fds to be handed to new VMs? >> >> Nope, it's not related to linking. The proposed KVM_LINK_GUEST_MEMFD >> ioctl [4] also doesn't check if the source VM is dead. There shouldn't >> be any check on the source VM, since the memory is from guest_memfd and >> should be independently transferable to a new VM. > > If a page is mapped in the old TD, and a new TD is started, re-mapping the same > page should be prevented somehow, right? > Currently I'm thinking that if we go with HWpoison, the new TD will still get the HWpoison-ed page. The new TD will get the SIGBUS when it next faults the HWpoison-ed page. Are you thinking that the HWpoison-ed page should be replaced with a non-poisoned page for the new TD to run? Or are you thinking that * the handing over should be blocked, or * mapping itself should be blocked, or * faulting should be blocked? If handing over should be blocked, could we perhaps scan for HWpoison when doing the handover and block it there? I guess I'm trying to do as little as possible during error discovery (hoping to just mark HWpoison), error handling (just unmap from guest page tables, like guest_memfd does now), and defer handling to fault/conversion/perhaps truncation time. > It really does seem like guestmemfd is the right place to keep the the "stuck > page" state. If guestmemfd is not tied to a VM and can be re-used, it should be > the one to decide whether they can be mapped again. Yup, guest_memfd should get to decide. > Refcounting on error is > about preventing return to the page allocator but that is not the only problem. > guest_memfd, or perhaps the memory_failure() handler for guest_memfd, should prevent this return. > I do think that these threads have gone on far too long. It's probably about > time to move forward with something even if it's just to have something to > discuss that doesn't require footnoting so many lore links. So how about we move > forward with option e as a next step. Does that sound good Yan? > Please see my reply to Yan, I'm hoping y'all will agree to something between option f/g instead. > Ackerley, thank you very much for pulling together this summary. Thank you for your reviews and suggestions!