Ackerley Tng wrote: > Hello, > > This patchset builds upon discussion at LPC 2024 and many guest_memfd > upstream calls to provide 1G page support for guest_memfd by taking > pages from HugeTLB. > > This patchset is based on Linux v6.15-rc6, and requires the mmap support > for guest_memfd patchset (Thanks Fuad!) [1]. Trying to manage dependencies I find that Ryan's just released series[1] is required to build this set. [1] https://lore.kernel.org/all/cover.1747368092.git.afranji@xxxxxxxxxx/ Specifically this patch: https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@xxxxxxxxxx/ defines alloc_anon_secure_inode() Am I wrong in that? > > For ease of testing, this series is also available, stitched together, > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2 > I went digging in your git tree and then found Ryan's set. So thanks for the git tree. :-D However, it seems this add another dependency which should be managed in David's email of dependencies? Ira > This patchset can be divided into two sections: > > (a) Patches from the beginning up to and including "KVM: selftests: > Update script to map shared memory from guest_memfd" are a modified > version of "conversion support for guest_memfd", which Fuad is > managing [2]. > > (b) Patches after "KVM: selftests: Update script to map shared memory > from guest_memfd" till the end are patches that actually bring in 1G > page support for guest_memfd. > > These are the significant differences between (a) and [2]: > > + [2] uses an xarray to track sharability, but I used a maple tree > because for 1G pages, iterating pagewise to update shareability was > prohibitively slow even for testing. I was choosing from among > multi-index xarrays, interval trees and maple trees [3], and picked > maple trees because > + Maple trees were easier to figure out since I didn't have to > compute the correct multi-index order and handle edge cases if the > converted range wasn't a neat power of 2. > + Maple trees were easier to figure out as compared to updating > parts of a multi-index xarray. > + Maple trees had an easier API to use than interval trees. > + [2] doesn't yet have a conversion ioctl, but I needed it to test 1G > support end-to-end. > + (a) Removes guest_memfd from participating in LRU, which I needed, to > get conversion selftests to work as expected, since participation in > LRU was causing some unexpected refcounts on folios which was blocking > conversions. > > I am sending (a) in emails as well, as opposed to just leaving it on > GitHub, so that we can discuss by commenting inline on emails. If you'd > like to just look at 1G page support, here are some key takeaways from > the first section (a): > > + If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd > creation, guest_memfd will > + Track shareability (whether an index in the inode is guest-only or > if the host is allowed to fault memory at a given index). > + Always be used for guest faults - specifically, kvm_gmem_get_pfn() > will be used to provide pages for the guest. > + Always be used by KVM to check private/shared status of a gfn. > + guest_memfd now has conversion ioctls, allowing conversion to > private/shared > + Conversion can fail if there are unexpected refcounts on any > folios in the range. > > Focusing on (b) 1G page support, here's an overview: > > 1. A bunch of refactoring patches for HugeTLB that isolates the > allocation of a HugeTLB folio from other HugeTLB concepts such as > VMA-level reservations, and HugeTLBfs-specific concepts, such as > where memory policy is stored in the VMA, or where the subpool is > stored on the inode. > 2. A few patches that add a guestmem_hugetlb allocator within mm/. The > guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize > the memory management functions, and to cleanly handle cleanup, so > that folio cleanup can happen after the guest_memfd inode (and even > KVM) goes away. > 3. Some updates to guest_memfd to use the guestmem_hugetlb allocator. > 4. Selftests for 1G page support. > > Here are some remaining issues/TODOs: > > 1. Memory error handling such as machine check errors have not been > implemented. > 2. I've not looked into preparedness of pages, only zeroing has been > considered. > 3. When allocating HugeTLB pages, if two threads allocate indices > mapping to the same huge page, the utilization in guest_memfd inode's > subpool may momentarily go over the subpool limit (the requested size > of the inode at guest_memfd creation time), causing one of the two > threads to get -ENOMEM. Suggestions to solve this are appreciated! > 4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB > pages should be correct but needs testing and could be wrong. > 5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd > HugeTLB pages after splitting should be correct but needs testing and > could be wrong. > 6. Page cache accounting: When a hugetlb page is split, guest_memfd will > incur page count in both NR_HUGETLB (counted at hugetlb allocation > time) and NR_FILE_PAGES stats (counted when split pages are added to > the filemap). Is this aligned with what people expect? > > Here are some optimizations that could be explored in future series: > > 1. Pages could be split from 1G to 2M first and only split to 4K if > necessary. > 2. Zeroing could be skipped for Coco VMs if hardware already zeroes the > pages. > > Here's RFC v1 [4] if you're interested in the motivation behind choosing > HugeTLB, or the history of this patch series. > > [1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@xxxxxxxxxx/T/ > [2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@xxxxxxxxxx/T/ > [3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > [4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@xxxxxxxxxx/T/ >