On Fri, May 09, 2025 at 07:20:30AM -0700, Vishal Annapurve wrote: > On Thu, May 8, 2025 at 8:22 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > On Thu, May 08, 2025 at 07:10:19AM -0700, Vishal Annapurve wrote: > > > On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > > > > > On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote: > > > > > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > > > > > > > > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote: > > > > > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote: > > > > > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > Sorry for the late reply, I was on leave last week. > > > > > > > > > > > > > > > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote: > > > > > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote: > > > > > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking > > > > > > > > > > > > folio_ref_add() in the event of a removal failure. > > > > > > > > > > > > > > > > > > > > > > In my opinion, the above scheme can be deployed with this series > > > > > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an > > > > > > > > > > I initially intended to add a separate patch at the end of this series to > > > > > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I > > > > > > > > > > decided against it since it's not a must before guest_memfd supports in-place > > > > > > > > > > conversion. > > > > > > > > > > > > > > > > > > > > We can include it in the next version If you think it's better. > > > > > > > > > > > > > > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with > > > > > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to > > > > > > > > > hold extra refcounts in TDX stack so it would be good to clean up this > > > > > > > > > logic. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by > > > > > > > > > > > page structs, but that problem can be solved in future possibly by > > > > > > > > > > With current TDX code, all memory must be backed by a page struct. > > > > > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather > > > > > > > > > > than a pfn. > > > > > > > > > > > > > > > > > > > > > notifying guest_memfd of certain ranges being in use even after > > > > > > > > > > > invalidation completes. > > > > > > > > > > A curious question: > > > > > > > > > > To support memory not backed by page structs in future, is there any counterpart > > > > > > > > > > to the page struct to hold ref count and map count? > > > > > > > > > > > > > > > > > > > > > > > > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP > > > > > > > > > [1] memory. No need to maintain refcounts/map counts for such physical > > > > > > > > > memory ranges as all users will be notified when mappings are > > > > > > > > > changed/removed. > > > > > > > > So, it's possible to map such memory in both shared and private EPT > > > > > > > > simultaneously? > > > > > > > > > > > > > > No, guest_memfd will still ensure that userspace can only fault in > > > > > > > shared memory regions in order to support CoCo VM usecases. > > > > > > Before guest_memfd converts a PFN from shared to private, how does it ensure > > > > > > there are no shared mappings? e.g., in [1], it uses the folio reference count > > > > > > to ensure that. > > > > > > > > > > > > Or do you believe that by eliminating the struct page, there would be no > > > > > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in > > > > > > response to a guest_memfd invalidation notification? > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have > > > > > > no need to register mmu notifier. So why users like VFIO must register > > > > > > guest_memfd invalidation notification? > > > > > > > > > > VM_PFNMAP'd memory can't be long term pinned, so users of such memory > > > > > ranges will have to adopt mechanisms to get notified. I think it would > > > > Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory. > > > > > > I don't completely understand how VM_PFNMAP'd memory is used today for > > > VFIO. Maybe only MMIO regions are backed by pfnmap today and the story > > > for normal memory backed by pfnmap is yet to materialize. > > VFIO can fault in VM_PFNMAP'd memory which is not from MMIO regions. It works > > because it knows VM_PFNMAP'd memory are always pinned. > > > > Another example is udmabuf (drivers/dma-buf/udmabuf.c), it mmaps normal folios > > with VM_PFNMAP flag without registering mmu notifier because those folios are > > pinned. > > > > I might be wrongly throwing out some terminologies here then. > VM_PFNMAP flag can be set for memory backed by folios/page structs. > udmabuf seems to be working with pinned "folios" in the backend. > > The goal is to get to a stage where guest_memfd is backed by pfn > ranges unmanaged by kernel that guest_memfd owns and distributes to > userspace, KVM, IOMMU subject to shareability attributes. if the OK. So from point of the reset part of kernel, those pfns are not regarded as memory. > shareability changes, the users will get notified and will have to > invalidate their mappings. guest_memfd will allow mmaping such ranges > with VM_PFNMAP flag set by default in the VMAs to indicate the need of > special handling/lack of page structs. My concern is a failable invalidation notifer may not be ideal. Instead of relying on ref counts (or other mechanisms) to determine whether to start shareabilitiy changes, with a failable invalidation notifier, some users may fail the invalidation and the shareability change, even after other users have successfully unmapped a range. Auditing whether multiple users of shared memory correctly perform unmapping is harder than auditing reference counts. > private memory backed by page structs and use a special "filemap" to > map file offsets to these private memory ranges. This step will also > need similar contract with users - > 1) memory is pinned by guest_memfd > 2) users will get invalidation notifiers on shareability changes > > I am sure there is a lot of work here and many quirks to be addressed, > let's discuss this more with better context around. A few related RFC > series are planned to be posted in the near future. Ok. Thanks for your time and discussions :) > > > > > > > > > be easy to pursue new users of guest_memfd to follow this scheme. > > > > > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd > > > > > hugepage support already needs the stance of: "Guest memfd owns all > > > > > long-term refcounts on private memory" as discussed at LPC [1]. > > > > > > > > > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf > > > > > (slide 12) > > > > > > > > > > > > > > > > > Besides, how would guest_memfd handle potential unmap failures? e.g. what > > > > > > happens to prevent converting a private PFN to shared if there are errors when > > > > > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN. > > > > > > > > > > Users will have to signal such failures via the invalidation callback > > > > > results or other appropriate mechanisms. guest_memfd can relay the > > > > > failures up the call chain to the userspace. > > > > AFAIK, operations that perform actual unmapping do not allow failure, e.g. > > > > kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(), > > > > vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount(). > > > > > > Very likely because these operations simply don't fail. > > > > I think they are intentionally designed to be no-fail. > > > > e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup > > buffer allocated on stack in case of kmalloc() failure. > > > > > > > > > > > > That's why we rely on increasing folio ref count to reflect failure, which are > > > > due to unexpected SEAMCALL errors. > > > > > > TDX stack is adding a scenario where invalidation can fail, a cleaner > > > solution would be to propagate the result as an invalidation failure. > > Not sure if linux kernel accepts unmap failure. > > > > > Another option is to notify guest_memfd out of band to convey the > > > ranges that failed invalidation. > > Yes, this might be better. Something similar like holding folio ref count to > > let guest_memfd know that a certain PFN cannot be re-assigned. > > > > > With in-place conversion supported, even if the refcount is raised for > > > such pages, they can still get used by the host if the guest_memfd is > > > unaware that the invalidation failed. > > I thought guest_memfd should check if folio ref count is 0 (or a base count) > > before conversion, splitting or re-assignment. Otherwise, why do you care if > > TDX holds the ref count? :) > > > > Soon to be posted RFC series by Ackerley currently explicitly checks > for safe private page refcounts when folio splitting is needed and not > for every private to shared conversion. A simple solution would be for > guest_memfd to check safe page refcounts for each private to shared > conversion even if split is not required but will need to be reworked > when either of the stages discussed above land where page structs are > not around. > > > > > > > > > > > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN > > > > > > that fails to be unmapped. > > > > > > > > > > > > > > > > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@xxxxxxxxxx/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Any guest_memfd range updates will result in invalidations/updates of > > > > > > > > > userspace, guest, IOMMU or any other page tables referring to > > > > > > > > > guest_memfd backed pfns. This story will become clearer once the > > > > > > > > > support for PFN range allocator for backing guest_memfd starts getting > > > > > > > > > discussed. > > > > > > > > Ok. It is indeed unclear right now to support such kind of memory. > > > > > > > > > > > > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory > > > > > > > > into private EPT until TDX connect. > > > > > > > > > > > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd > > > > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX > > > > > > > connect/Sev TIO, major difference would be that guest_memfd private > > > > > > > ranges will be mapped into IOMMU page tables. > > > > > > > > > > > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there > > > > > > > have been discussions on not using page structs for private memory > > > > > > > ranges altogether [1] even with hugetlb allocator, which will simplify > > > > > > > seamless merge/split story for private hugepages to support memory > > > > > > > conversion. So I think the general direction we should head towards is > > > > > > > not relying on refcounts for guest_memfd private ranges and/or page > > > > > > > structs altogether. > > > > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to > > > > > > keep all necessary info. > > > > > > > > > > > > > > > > Story will become clearer once VM_PFNMAP'd memory support starts > > > > > getting discussed. In case of guest_memfd, there is flexibility to > > > > > store metadata for physical ranges within guest_memfd just like > > > > > shareability tracking. > > > > Ok. > > > > > > > > > > > > > > > > > I think the series [2] to work better with PFNMAP'd physical memory in > > > > > > > KVM is in the very right direction of not assuming page struct backed > > > > > > > memory ranges for guest_memfd as well. > > > > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM > > > > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)". > > > > > > > > > > > > > > > > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@xxxxxxxxxxxxxx/ > > > > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@xxxxxxxxxx/ > > > > > > > > > > > > > > > And even in that scenario, the memory is only for private MMIO, so the backend > > > > > > > > driver is VFIO pci driver rather than guest_memfd. > > > > > > > > > > > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed > > > > > > > by VM_PFNMAP memory. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543 > > > > > >