Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 30 Apr 2025 22:38:10 +0800

On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
>
> On 29 Apr 2025, at 22:33, Yafang Shao wrote:
>
> > On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
> >>
> >> Hi Yafang,
> >>
> >> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
> >> people there in your next version? (I added them here)
> >>
> >> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
> >
> > Thanks for your reminder.
> > I will add the maintainers and reviewers in the next version.
> >
> >>
> >> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> >>> In our container environment, we aim to enable THP selectively—allowing
> >>> specific services to use it while restricting others. This approach is
> >>> driven by the following considerations:
> >>>
> >>> 1. Memory Fragmentation
> >>>    THP can lead to increased memory fragmentation, so we want to limit its
> >>>    use across services.
> >>> 2. Performance Impact
> >>>    Some services see no benefit from THP, making its usage unnecessary.
> >>> 3. Performance Gains
> >>>    Certain workloads, such as machine learning services, experience
> >>>    significant performance improvements with THP, so we enable it for them
> >>>    specifically.
> >>>
> >>> Since multiple services run on a single host in a containerized environment,
> >>> enabling THP globally is not ideal. Previously, we set THP to madvise,
> >>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> >>> approach had limitation:
> >>>
> >>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> >>>   third-party libraries, bypassing our restrictions.
> >>
> >> Basically, you want more precise control of THP enablement and the
> >> ability of overriding madvise() from userspace.
> >>
> >> In terms of overriding madvise(), do you have any concrete example of
> >> these third-party libraries? madvise() users are supposed to know what
> >> they are doing, so I wonder why they are causing trouble in your
> >> environment.
> >
> > To my knowledge, jemalloc [0] supports THP.
> > Applications using jemalloc typically rely on its default
> > configurations rather than explicitly enabling or disabling THP. If
> > the system is configured with THP=madvise, these applications may
> > automatically leverage THP where appropriate
> >
> > [0]. https://github.com/jemalloc/jemalloc
>
> It sounds like a userspace issue. For jemalloc, if applications require
> it, can't you replace the jemalloc with a one compiled with --disable-thp
> to work around the issue?

That’s not the issue this patchset is trying to address or work
around. I believe we should focus on the actual problem it's meant to
solve.

By the way, you might not raise this question if you were managing a
large fleet of servers. We're a platform provider, but we don’t
maintain all the packages ourselves. Users make their own choices
based on their specific requirements. It's not a feasible solution for
us to develop and maintain every package.

>
> >
> >>
> >>>
> >>> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
> >>> which is error-injectable, to blacklist unwanted services. While this
> >>> worked, it was error-prone and ineffective for services needing always mode,
> >>> as modifying their code to use madvise was impractical.
> >>>
> >>> To achieve finer-grained control, we introduced an fmod_ret-based solution.
> >>> Now, we dynamically adjust THP settings per service by hooking
> >>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
> >>> enable or disable on a per-service basis without global impact.
> >>
> >> hugepage_global_*() are whole system knobs. How did you use it to
> >> achieve per-service control? In terms of per-service, does it mean
> >> you need per-memcg group (I assume each service has its own memcg) THP
> >> configuration?
> >
> > With this new BPF hook, we can manage THP behavior either per-service
> > or per-memory.
> > In our use case, we’ve chosen memcg-based control for finer-grained
> > management. Below is a simplified example of our implementation:
> >
> > struct{
> >         __uint(type, BPF_MAP_TYPE_HASH);
> >         __uint(max_entries, 4096);      /* usually there won't too
> > many cgroups */
> >         __type(key, u64);
> >         __type(value, u32);
> >         __uint(map_flags, BPF_F_NO_PREALLOC);
> > } thp_whitelist SEC(".maps");
> >
> > SEC("fmod_ret/mm_bpf_thp_vma_allowable")
> > int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
> > {
> >         struct cgroup_subsys_state *css;
> >         struct css_set *cgroups;
> >         struct mm_struct *mm;
> >         struct cgroup *cgroup;
> >         struct cgroup *parent;
> >         struct task_struct *p;
> >         u64 cgrp_id;
> >
> >         if (!vma)
> >                 return 0;
> >
> >         mm = vma->vm_mm;
> >         if (!mm)
> >                 return 0;
> >
> >         p = mm->owner;
> >         cgroups = p->cgroups;
> >         cgroup = cgroups->subsys[memory_cgrp_id]->cgroup;
> >         cgrp_id = cgroup->kn->id;
> >
> >         /* Allow the tasks in the thp_whiltelist to use THP. */
> >         if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id))
> >             return 1;
> >         return 0;
> > }
> >
> > I chose not to include this in the self-tests to avoid the complexity
> > of setting up cgroups for testing purposes. However, in patch #4 of
> > this series, I've included a simpler example demonstrating task-level
> > control.
>
> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?

You’ll need to modify the user-space code—and again, this likely
wouldn’t be a concern if you were managing a large fleet of servers.

>
> > For service-level control, we could potentially utilize BPF task local
> > storage as an alternative approach.
>
> +cgroup people
>
> For service-level control, there was a proposal of adding cgroup based
> THP control[1]. You might need a strong use case to convince people.
>
> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/

Thanks for the reference. I've reviewed the related discussion, and if
I understand correctly, the proposal was rejected by the maintainers.

-- 
Regards
Yafang