On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@xxxxxxxxxx> wrote: > > On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > On 02.05.25 14:50, Jann Horn wrote: > > > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > >> On 02.05.25 00:29, Nico Pache wrote: > > >>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > >>>> > > >>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@xxxxxxxxxx> wrote: > > >>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes. > > >>>>> While scanning PMD ranges for potential collapse candidates, keep track > > >>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit > > >>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If > > >>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the > > >>>>> scan phase so we dont bailout early and miss potential mTHP candidates. > > >>>>> > > >>>>> After the scan is complete we will perform binary recursion on the > > >>>>> bitmap to determine which mTHP size would be most efficient to collapse > > >>>>> to. max_ptes_none will be scaled by the attempted collapse order to > > >>>>> determine how full a THP must be to be eligible. > > >>>>> > > >>>>> If a mTHP collapse is attempted, but contains swapped out, or shared > > >>>>> pages, we dont perform the collapse. > > >>>> [...] > > >>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > >>>>> vma_start_write(vma); > > >>>>> anon_vma_lock_write(vma->anon_vma); > > >>>>> > > >>>>> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, > > >>>>> - address + HPAGE_PMD_SIZE); > > >>>>> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address, > > >>>>> + _address + (PAGE_SIZE << order)); > > >>>>> mmu_notifier_invalidate_range_start(&range); > > >>>>> > > >>>>> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ > > >>>>> + > > >>>>> /* > > >>>>> * This removes any huge TLB entry from the CPU so we won't allow > > >>>>> * huge and small TLB entries for the same virtual address to > > >>>> > > >>>> It's not visible in this diff, but we're about to do a > > >>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the > > >>>> entire page table, meaning it tears down 2MiB of address space; and it > > >>>> assumes that the entire page table exclusively corresponds to the > > >>>> current VMA. > > >>>> > > >>>> I think you'll need to ensure that the pmdp_collapse_flush() only > > >>>> happens for full-size THP, and that mTHP only tears down individual > > >>>> PTEs in the relevant range. (That code might get a bit messy, since > > >>>> the existing THP code tears down PTEs in a detached page table, while > > >>>> mTHP would have to do it in a still-attached page table.) > > >>> Hi Jann! > > >>> > > >>> I was under the impression that this is needed to prevent GUP-fast > > >>> races (and potentially others). > > > > > > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP? > > > > > >>> As you state here, conceptually the PMD case is, detach the PMD, do > > >>> the collapse, then reinstall the PMD (similarly to how the system > > >>> recovers from a failed PMD collapse). I tried to keep the current > > >>> locking behavior as it seemed the easiest way to get it right (and not > > >>> break anything). So I keep the PMD detaching and reinstalling for the > > >>> mTHP case too. As Hugh points out I am releasing the anon lock too > > >>> early. I will comment further on his response. > > > > > > As I see it, you're not "keeping" the current locking behavior; you're > > > making a big implicit locking change by reusing a codepath designed > > > for PMD THP for mTHP, where the page table may not be exclusively > > > owned by one VMA. > > > > That is not the intention. The intention in this series (at least as we > > discussed) was to not do it across VMAs; that is considered the next > > logical step (which will be especially relevant on arm64 IMHO). > > Ah, so for now this is supposed to only work for PTEs which are in a > PMD which is fully covered by the VMA? So if I make a 16KiB VMA and > then try to collapse its contents to an order-2 mTHP page, that should > just not work? Correct! As I started in reply to Hugh, the locking conditions explode if we drop that requirement. A simple workaround we've considered is only collapsing if a single VMA intersects a PMD. I can make sure this is more clear in the coverletter + this patch. Cheers, -- Nico >