Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Fri, 16 May 2025 15:43:14 -0700

Ackerley Tng <ackerleytng@xxxxxxxxxx> writes:

> <snip>
>
> Here are some remaining issues/TODOs:
>
> 1. Memory error handling such as machine check errors have not been
>    implemented.
> 2. I've not looked into preparedness of pages, only zeroing has been
>    considered.
> 3. When allocating HugeTLB pages, if two threads allocate indices
>    mapping to the same huge page, the utilization in guest_memfd inode's
>    subpool may momentarily go over the subpool limit (the requested size
>    of the inode at guest_memfd creation time), causing one of the two
>    threads to get -ENOMEM. Suggestions to solve this are appreciated!
> 4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
>    pages should be correct but needs testing and could be wrong.
> 5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
>    HugeTLB pages after splitting should be correct but needs testing and
>    could be wrong.
> 6. Page cache accounting: When a hugetlb page is split, guest_memfd will
>    incur page count in both NR_HUGETLB (counted at hugetlb allocation
>    time) and NR_FILE_PAGES stats (counted when split pages are added to
>    the filemap). Is this aligned with what people expect?
>

For people who might be testing this series with non-Coco VMs (heads up,
Patrick and Nikita!), this currently splits the folio as long as some
shareability in the huge folio is shared, which is probably unnecessary?

IIUC core-mm doesn't support mapping at 1G but from a cursory reading it
seems like the faulting function calling kvm_gmem_fault_shared() could
possibly be able to map a 1G page at 4K.

Looks like we might need another flag like
GUEST_MEMFD_FLAG_SUPPORT_CONVERSION, which will gate initialization of
the shareability maple tree/xarray.

If shareability is NULL for the entire hugepage range, then no splitting
will occur.

For Coco VMs, this should be safe, since if this flag is not set,
kvm_gmem_fault_shared() will always not be able to fault (the
shareability value will be NULL.

> Here are some optimizations that could be explored in future series:
>
> 1. Pages could be split from 1G to 2M first and only split to 4K if
>    necessary.
> 2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
>    pages.
>
> <snip>