Re: [PATCH v5 07/12] khugepaged: add mTHP support

Jann Horn <jannh@xxxxxxxxxx> · Fri, 2 May 2025 17:26:20 +0200

On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 02.05.25 14:50, Jann Horn wrote:
> > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >> On 02.05.25 00:29, Nico Pache wrote:
> >>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@xxxxxxxxxx> wrote:
> >>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> >>>>> While scanning PMD ranges for potential collapse candidates, keep track
> >>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> >>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> >>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> >>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
> >>>>>
> >>>>> After the scan is complete we will perform binary recursion on the
> >>>>> bitmap to determine which mTHP size would be most efficient to collapse
> >>>>> to. max_ptes_none will be scaled by the attempted collapse order to
> >>>>> determine how full a THP must be to be eligible.
> >>>>>
> >>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
> >>>>> pages, we dont perform the collapse.
> >>>> [...]
> >>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>>>           vma_start_write(vma);
> >>>>>           anon_vma_lock_write(vma->anon_vma);
> >>>>>
> >>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> >>>>> -                               address + HPAGE_PMD_SIZE);
> >>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> >>>>> +                               _address + (PAGE_SIZE << order));
> >>>>>           mmu_notifier_invalidate_range_start(&range);
> >>>>>
> >>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >>>>> +
> >>>>>           /*
> >>>>>            * This removes any huge TLB entry from the CPU so we won't allow
> >>>>>            * huge and small TLB entries for the same virtual address to
> >>>>
> >>>> It's not visible in this diff, but we're about to do a
> >>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> >>>> entire page table, meaning it tears down 2MiB of address space; and it
> >>>> assumes that the entire page table exclusively corresponds to the
> >>>> current VMA.
> >>>>
> >>>> I think you'll need to ensure that the pmdp_collapse_flush() only
> >>>> happens for full-size THP, and that mTHP only tears down individual
> >>>> PTEs in the relevant range. (That code might get a bit messy, since
> >>>> the existing THP code tears down PTEs in a detached page table, while
> >>>> mTHP would have to do it in a still-attached page table.)
> >>> Hi Jann!
> >>>
> >>> I was under the impression that this is needed to prevent GUP-fast
> >>> races (and potentially others).
> >
> > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> >
> >>> As you state here, conceptually the PMD case is, detach the PMD, do
> >>> the collapse, then reinstall the PMD (similarly to how the system
> >>> recovers from a failed PMD collapse). I tried to keep the current
> >>> locking behavior as it seemed the easiest way to get it right (and not
> >>> break anything). So I keep the PMD detaching and reinstalling for the
> >>> mTHP case too. As Hugh points out I am releasing the anon lock too
> >>> early. I will comment further on his response.
> >
> > As I see it, you're not "keeping" the current locking behavior; you're
> > making a big implicit locking change by reusing a codepath designed
> > for PMD THP for mTHP, where the page table may not be exclusively
> > owned by one VMA.
>
> That is not the intention. The intention in this series (at least as we
> discussed) was to not do it across VMAs; that is considered the next
> logical step (which will be especially relevant on arm64 IMHO).

Ah, so for now this is supposed to only work for PTEs which are in a
PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
then try to collapse its contents to an order-2 mTHP page, that should
just not work?