Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment

Zi Yan <ziy@xxxxxxxxxx> · Wed, 30 Apr 2025 09:19:41 -0400

On 29 Apr 2025, at 22:33, Yafang Shao wrote:

> On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@xxxxxxxxxx> wrote:
>>
>> Hi Yafang,
>>
>> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
>> people there in your next version? (I added them here)
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
>
> Thanks for your reminder.
> I will add the maintainers and reviewers in the next version.
>
>>
>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
>>> In our container environment, we aim to enable THP selectively—allowing
>>> specific services to use it while restricting others. This approach is
>>> driven by the following considerations:
>>>
>>> 1. Memory Fragmentation
>>>    THP can lead to increased memory fragmentation, so we want to limit its
>>>    use across services.
>>> 2. Performance Impact
>>>    Some services see no benefit from THP, making its usage unnecessary.
>>> 3. Performance Gains
>>>    Certain workloads, such as machine learning services, experience
>>>    significant performance improvements with THP, so we enable it for them
>>>    specifically.
>>>
>>> Since multiple services run on a single host in a containerized environment,
>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
>>> approach had limitation:
>>>
>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
>>>   third-party libraries, bypassing our restrictions.
>>
>> Basically, you want more precise control of THP enablement and the
>> ability of overriding madvise() from userspace.
>>
>> In terms of overriding madvise(), do you have any concrete example of
>> these third-party libraries? madvise() users are supposed to know what
>> they are doing, so I wonder why they are causing trouble in your
>> environment.
>
> To my knowledge, jemalloc [0] supports THP.
> Applications using jemalloc typically rely on its default
> configurations rather than explicitly enabling or disabling THP. If
> the system is configured with THP=madvise, these applications may
> automatically leverage THP where appropriate
>
> [0]. https://github.com/jemalloc/jemalloc

It sounds like a userspace issue. For jemalloc, if applications require
it, can't you replace the jemalloc with a one compiled with --disable-thp
to work around the issue?

>
>>
>>>
>>> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
>>> which is error-injectable, to blacklist unwanted services. While this
>>> worked, it was error-prone and ineffective for services needing always mode,
>>> as modifying their code to use madvise was impractical.
>>>
>>> To achieve finer-grained control, we introduced an fmod_ret-based solution.
>>> Now, we dynamically adjust THP settings per service by hooking
>>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
>>> enable or disable on a per-service basis without global impact.
>>
>> hugepage_global_*() are whole system knobs. How did you use it to
>> achieve per-service control? In terms of per-service, does it mean
>> you need per-memcg group (I assume each service has its own memcg) THP
>> configuration?
>
> With this new BPF hook, we can manage THP behavior either per-service
> or per-memory.
> In our use case, we’ve chosen memcg-based control for finer-grained
> management. Below is a simplified example of our implementation:
>
> struct{
>         __uint(type, BPF_MAP_TYPE_HASH);
>         __uint(max_entries, 4096);      /* usually there won't too
> many cgroups */
>         __type(key, u64);
>         __type(value, u32);
>         __uint(map_flags, BPF_F_NO_PREALLOC);
> } thp_whitelist SEC(".maps");
>
> SEC("fmod_ret/mm_bpf_thp_vma_allowable")
> int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
> {
>         struct cgroup_subsys_state *css;
>         struct css_set *cgroups;
>         struct mm_struct *mm;
>         struct cgroup *cgroup;
>         struct cgroup *parent;
>         struct task_struct *p;
>         u64 cgrp_id;
>
>         if (!vma)
>                 return 0;
>
>         mm = vma->vm_mm;
>         if (!mm)
>                 return 0;
>
>         p = mm->owner;
>         cgroups = p->cgroups;
>         cgroup = cgroups->subsys[memory_cgrp_id]->cgroup;
>         cgrp_id = cgroup->kn->id;
>
>         /* Allow the tasks in the thp_whiltelist to use THP. */
>         if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id))
>             return 1;
>         return 0;
> }
>
> I chose not to include this in the self-tests to avoid the complexity
> of setting up cgroups for testing purposes. However, in patch #4 of
> this series, I've included a simpler example demonstrating task-level
> control.

For task-level control, why not using prctl(PR_SET_THP_DISABLE)?

> For service-level control, we could potentially utilize BPF task local
> storage as an alternative approach.

+cgroup people

For service-level control, there was a proposal of adding cgroup based
THP control[1]. You might need a strong use case to convince people.

[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/

--
Best Regards,
Yan, Zi