Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 30 Apr 2025 23:16:28 +0800

On Wed, Apr 30, 2025 at 11:00 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
>
> On 30 Apr 2025, at 10:38, Yafang Shao wrote:
>
> > On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
> >>
> >> On 29 Apr 2025, at 22:33, Yafang Shao wrote:
> >>
> >>> On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
> >>>>
> >>>> Hi Yafang,
> >>>>
> >>>> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
> >>>> people there in your next version? (I added them here)
> >>>>
> >>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
> >>>
> >>> Thanks for your reminder.
> >>> I will add the maintainers and reviewers in the next version.
> >>>
> >>>>
> >>>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> >>>>> In our container environment, we aim to enable THP selectively—allowing
> >>>>> specific services to use it while restricting others. This approach is
> >>>>> driven by the following considerations:
> >>>>>
> >>>>> 1. Memory Fragmentation
> >>>>>    THP can lead to increased memory fragmentation, so we want to limit its
> >>>>>    use across services.
> >>>>> 2. Performance Impact
> >>>>>    Some services see no benefit from THP, making its usage unnecessary.
> >>>>> 3. Performance Gains
> >>>>>    Certain workloads, such as machine learning services, experience
> >>>>>    significant performance improvements with THP, so we enable it for them
> >>>>>    specifically.
> >>>>>
> >>>>> Since multiple services run on a single host in a containerized environment,
> >>>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
> >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> >>>>> approach had limitation:
> >>>>>
> >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> >>>>>   third-party libraries, bypassing our restrictions.
> >>>>
> >>>> Basically, you want more precise control of THP enablement and the
> >>>> ability of overriding madvise() from userspace.
> >>>>
> >>>> In terms of overriding madvise(), do you have any concrete example of
> >>>> these third-party libraries? madvise() users are supposed to know what
> >>>> they are doing, so I wonder why they are causing trouble in your
> >>>> environment.
> >>>
> >>> To my knowledge, jemalloc [0] supports THP.
> >>> Applications using jemalloc typically rely on its default
> >>> configurations rather than explicitly enabling or disabling THP. If
> >>> the system is configured with THP=madvise, these applications may
> >>> automatically leverage THP where appropriate
> >>>
> >>> [0]. https://github.com/jemalloc/jemalloc
> >>
> >> It sounds like a userspace issue. For jemalloc, if applications require
> >> it, can't you replace the jemalloc with a one compiled with --disable-thp
> >> to work around the issue?
> >
> > That’s not the issue this patchset is trying to address or work
> > around. I believe we should focus on the actual problem it's meant to
> > solve.
> >
> > By the way, you might not raise this question if you were managing a
> > large fleet of servers. We're a platform provider, but we don’t
> > maintain all the packages ourselves. Users make their own choices
> > based on their specific requirements. It's not a feasible solution for
> > us to develop and maintain every package.
>
> Basically, user wants to use THP, but as a service provider, you think
> differently, so want to override userspace choice. Am I getting it right?

No—the users aren’t specifically concerned with THP. They just copied
a configuration from the internet and deployed it in the production
environment.

>
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
> >>>>> which is error-injectable, to blacklist unwanted services. While this
> >>>>> worked, it was error-prone and ineffective for services needing always mode,
> >>>>> as modifying their code to use madvise was impractical.
> >>>>>
> >>>>> To achieve finer-grained control, we introduced an fmod_ret-based solution.
> >>>>> Now, we dynamically adjust THP settings per service by hooking
> >>>>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
> >>>>> enable or disable on a per-service basis without global impact.
> >>>>
> >>>> hugepage_global_*() are whole system knobs. How did you use it to
> >>>> achieve per-service control? In terms of per-service, does it mean
> >>>> you need per-memcg group (I assume each service has its own memcg) THP
> >>>> configuration?
> >>>
> >>> With this new BPF hook, we can manage THP behavior either per-service
> >>> or per-memory.
> >>> In our use case, we’ve chosen memcg-based control for finer-grained
> >>> management. Below is a simplified example of our implementation:
> >>>
> >>> struct{
> >>>         __uint(type, BPF_MAP_TYPE_HASH);
> >>>         __uint(max_entries, 4096);      /* usually there won't too
> >>> many cgroups */
> >>>         __type(key, u64);
> >>>         __type(value, u32);
> >>>         __uint(map_flags, BPF_F_NO_PREALLOC);
> >>> } thp_whitelist SEC(".maps");
> >>>
> >>> SEC("fmod_ret/mm_bpf_thp_vma_allowable")
> >>> int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
> >>> {
> >>>         struct cgroup_subsys_state *css;
> >>>         struct css_set *cgroups;
> >>>         struct mm_struct *mm;
> >>>         struct cgroup *cgroup;
> >>>         struct cgroup *parent;
> >>>         struct task_struct *p;
> >>>         u64 cgrp_id;
> >>>
> >>>         if (!vma)
> >>>                 return 0;
> >>>
> >>>         mm = vma->vm_mm;
> >>>         if (!mm)
> >>>                 return 0;
> >>>
> >>>         p = mm->owner;
> >>>         cgroups = p->cgroups;
> >>>         cgroup = cgroups->subsys[memory_cgrp_id]->cgroup;
> >>>         cgrp_id = cgroup->kn->id;
> >>>
> >>>         /* Allow the tasks in the thp_whiltelist to use THP. */
> >>>         if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id))
> >>>             return 1;
> >>>         return 0;
> >>> }
> >>>
> >>> I chose not to include this in the self-tests to avoid the complexity
> >>> of setting up cgroups for testing purposes. However, in patch #4 of
> >>> this series, I've included a simpler example demonstrating task-level
> >>> control.
> >>
> >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
> >
> > You’ll need to modify the user-space code—and again, this likely
> > wouldn’t be a concern if you were managing a large fleet of servers.
> >
> >>
> >>> For service-level control, we could potentially utilize BPF task local
> >>> storage as an alternative approach.
> >>
> >> +cgroup people
> >>
> >> For service-level control, there was a proposal of adding cgroup based
> >> THP control[1]. You might need a strong use case to convince people.
> >>
> >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/
> >
> > Thanks for the reference. I've reviewed the related discussion, and if
> > I understand correctly, the proposal was rejected by the maintainers.
>
> I wonder why your approach is better than the cgroup based THP control proposal.

It’s more flexible, and you can still use it even without cgroups.

One limitation is that CONFIG_MEMCG must be enabled due to the use of
mm_struct::owner. I'm wondering if it would be feasible to decouple
mm_struct::owner from CONFIG_MEMCG. Alternatively, if there’s another
reliable way to retrieve the task_struct without relying on
mm_struct::owner, we could consider adding BPF kfuncs to support it.

-- 
Regards
Yafang