Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Fri, 12 Sep 2025 14:48:09 +0100

On Fri, Sep 12, 2025 at 02:17:01PM +0800, Yafang Shao wrote:
> On Thu, Sep 11, 2025 at 11:58 PM Lorenzo Stoakes
> <lorenzo.stoakes@xxxxxxxxxx> wrote:
> >
> > On Wed, Sep 10, 2025 at 10:44:41AM +0800, Yafang Shao wrote:
> > > Currently, THP allocation cannot be restricted to khugepaged alone while
> > > being disabled in the page fault path. This limitation exists because
> > > disabling THP allocation during page faults also prevents the execution of
> > > khugepaged_enter_vma() in that path.
> >
> > This is quite confusing, I see what you mean - you want to be able to disable
> > page fault THP but not khugepaged THP _at the point of possibly faulting in a
> > THP aligned VMA_.
> >
> > It seems this patch makes khugepaged_enter_vma() unconditional for an anonymous
> > VMA, rather than depending on the return value specified by
> > thp_vma_allowable_order().
>
> The functions thp_vma_allowable_order(TVA_PAGEFAULT) and
> thp_vma_allowable_order(TVA_KHUGEPAGED) are functionally equivalent
> within the page fault handler; they always yield the same result.
> Consequently, their execution order is irrelevant.

It seems hard to definitely demonstrate that by checking !in_pf vs not in this
situation :) but it seems broadly true afaict.

So they differ only in that one starts khugepaged, the other tries to
establish a THP on fault via create_huge_pmd().

>
> The change reorders these two calls and, in doing so, also moves the
> call to vmf_anon_prepare(vmf). This alters the control flow:
> - before this change:  The logic checked the return value of
> vmf_anon_prepare() between the two thp_vma_allowable_order() calls.
>
>     thp_vma_allowable_order(TVA_PAGEFAULT);
>     ret = vmf_anon_prepare(vmf);
>     if (ret)
>         return ret;
>     thp_vma_allowable_order(TVA_KHUGEPAGED);

I mean it's also _only if_ the TVA_PAGEFAULT invocation succeeds that the
TVA_KHUGEPAGED one happens.

>
>  - after this change: The logic now executes both
> thp_vma_allowable_order() calls first and does not check the return
> value of vmf_anon_prepare().
>
>     thp_vma_allowable_order(TVA_KHUGEPAGED);
>     thp_vma_allowable_order(TVA_PAGEFAULT);
>     ret = vmf_anon_prepare(vmf); // Return value 'ret' is ignored.

Hm this is confusing, your code does:

+       if (pmd_none(*vmf.pmd)) {
+               if (vma_is_anonymous(vma))
+                       khugepaged_enter_vma(vma, vm_flags);
+               if (thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
+                       ret = create_huge_pmd(&vmf);
+                       if (!(ret & VM_FAULT_FALLBACK))
+                               return ret;
+               }

So the ret is absolutely not ignored, but whether it succeeds or not, we still
invoke khugepaged_enter_vma().

Previously we would not have one this had vmf_anon_prepare() failed in
do_huge_pmd_anonymous_page().

Which I guess is what you mean?

>
> This change is safe because the return value of vmf_anon_prepare() can
> be safely ignored. This function checks for transient system-level
> conditions (e.g., memory pressure, THP availability) that might
> prevent an immediate THP allocation. It does not guarantee that a
> subsequent allocation will succeed.
>
> This behavior is consistent with the policy in hugepage_madvise(),
> where a VMA is queued for khugepaged before a definitive allocation
> check. If the system is under pressure, khugepaged will simply retry
> the allocation at a more opportune time.

OK. I do note though that the khugepaged being kicked off is at mm_struct level.

So us trying to invoke khugepaged on the mm again is about.. something having
changed that would previously have prevented us but now doesn't?

That is, a product of thp_vma_allowable_order() right?

So probably a sysfs change or similar?

But I guess it makes sense to hook in BPF whenever this is the case because this
_could_ be the point at which khugepaged enters the mm, and we want to select
the allowable order at this time.

So on basis of the two checks being effectively equivalent (on assumption this
is always the case) then the change is fairly reasonable.

Though I would put this information, that the checks are equivalent, in the
commit message so it's really clear.

>
> >
> > So I think a clearer explanation is:
> >
> >         khugepaged_enter_vma() ultimately invokes any attached BPF function with
> >         the TVA_KHUGEPAGED flag set when determining whether or not to enable
> >         khugepaged THP for a freshly faulted in VMA.
> >
> >         Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as
> >         invoked by create_huge_pmd() and only when we have already checked to
> >         see if an allowable TVA_PAGEFAULT order is specified.
> >
> >         Since we might want to disallow THP on fault-in but allow it via
> >         khugepaged, we move things around so we always attempt to enter
> >         khugepaged upon fault.
>
> Thanks for the clarification.

Thanks

>
> >
> > Having said all this, I'm very confused.
> >
> > Why are we doing this?
> >
> > We only enable khugepaged _early_ when we know we're faulting in a huge PMD
> > here.
> >
> > I guess we do this because, if we are allowed to do the pagefault, maybe
> > something changed that might have previously disallowed khugepaged to run for
> > the mm.
> >
> > But now we're just checking unconditionally for... no reason?
>
> I have blamed the change history of do_huge_pmd_anonymous_page() but
> was unable to find any rationale for placing khugepaged_enter_vma()
> after the vmf_anon_prepare() check. I therefore believe this ordering
> is likely unintentional.

Right, yeah.

>
> >
> > if BPF disables page fault but not khugepaged, then surely the mm would already
> > be under be khugepaged if it could be?
>
> The behavior you describe applies to the madvise mode, not the always
> mode. To reiterate: the hugepage_madvise() function unconditionally
> adds the memory mm to the khugepaged queue, whereas the page fault
> handler employs conditional logic.

Right, so we suggest the order

>
> >
> > It's sort of immaterial if we get a pmd_none() that is not-faultable for
> > whatever reason but BPF might say is khugepaged'able, because it'd have already
> > set this.
> >
> > This is because if we just map a new VMA, we already let khugepaged have it via
> > khugepaged_enter_vma() in __mmap_new_vma() and in the merge paths.
> >
> > I mean maybe I'm missing something here :)
> >
> > >
> > > With the introduction of BPF, we can now implement THP policies based on
> > > different TVA types. This patch adjusts the logic to support this new
> > > capability.
> > >
> > > While we could also extend prtcl() to utilize this new policy, such a
> >
> > Typo: prtcl -> prctl
>
> thanks

Cheers

>
> >
> > > change would require a uAPI modification.
> >
> > Hm, in what respect? PR_SET_THP_DISABLE?
>
> Right, when can extend PR_SET_THP_DISABLE() to support this logic as well.

Yeah let's not touch that please :)

>
> --
> Regards
> Yafang

Cheers, Lorenzo