On Wed, Apr 30, 2025 at 11:00 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > > On 30 Apr 2025, at 10:38, Yafang Shao wrote: > > > On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > >> > >> On 29 Apr 2025, at 22:33, Yafang Shao wrote: > >> > >>> On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > >>>> > >>>> Hi Yafang, > >>>> > >>>> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing > >>>> people there in your next version? (I added them here) > >>>> > >>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589 > >>> > >>> Thanks for your reminder. > >>> I will add the maintainers and reviewers in the next version. > >>> > >>>> > >>>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote: > >>>>> In our container environment, we aim to enable THP selectively—allowing > >>>>> specific services to use it while restricting others. This approach is > >>>>> driven by the following considerations: > >>>>> > >>>>> 1. Memory Fragmentation > >>>>> THP can lead to increased memory fragmentation, so we want to limit its > >>>>> use across services. > >>>>> 2. Performance Impact > >>>>> Some services see no benefit from THP, making its usage unnecessary. > >>>>> 3. Performance Gains > >>>>> Certain workloads, such as machine learning services, experience > >>>>> significant performance improvements with THP, so we enable it for them > >>>>> specifically. > >>>>> > >>>>> Since multiple services run on a single host in a containerized environment, > >>>>> enabling THP globally is not ideal. Previously, we set THP to madvise, > >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this > >>>>> approach had limitation: > >>>>> > >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through > >>>>> third-party libraries, bypassing our restrictions. > >>>> > >>>> Basically, you want more precise control of THP enablement and the > >>>> ability of overriding madvise() from userspace. > >>>> > >>>> In terms of overriding madvise(), do you have any concrete example of > >>>> these third-party libraries? madvise() users are supposed to know what > >>>> they are doing, so I wonder why they are causing trouble in your > >>>> environment. > >>> > >>> To my knowledge, jemalloc [0] supports THP. > >>> Applications using jemalloc typically rely on its default > >>> configurations rather than explicitly enabling or disabling THP. If > >>> the system is configured with THP=madvise, these applications may > >>> automatically leverage THP where appropriate > >>> > >>> [0]. https://github.com/jemalloc/jemalloc > >> > >> It sounds like a userspace issue. For jemalloc, if applications require > >> it, can't you replace the jemalloc with a one compiled with --disable-thp > >> to work around the issue? > > > > That’s not the issue this patchset is trying to address or work > > around. I believe we should focus on the actual problem it's meant to > > solve. > > > > By the way, you might not raise this question if you were managing a > > large fleet of servers. We're a platform provider, but we don’t > > maintain all the packages ourselves. Users make their own choices > > based on their specific requirements. It's not a feasible solution for > > us to develop and maintain every package. > > Basically, user wants to use THP, but as a service provider, you think > differently, so want to override userspace choice. Am I getting it right? No—the users aren’t specifically concerned with THP. They just copied a configuration from the internet and deployed it in the production environment. > > > > >> > >>> > >>>> > >>>>> > >>>>> To address this issue, we initially hooked the __x64_sys_madvise() syscall, > >>>>> which is error-injectable, to blacklist unwanted services. While this > >>>>> worked, it was error-prone and ineffective for services needing always mode, > >>>>> as modifying their code to use madvise was impractical. > >>>>> > >>>>> To achieve finer-grained control, we introduced an fmod_ret-based solution. > >>>>> Now, we dynamically adjust THP settings per service by hooking > >>>>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to > >>>>> enable or disable on a per-service basis without global impact. > >>>> > >>>> hugepage_global_*() are whole system knobs. How did you use it to > >>>> achieve per-service control? In terms of per-service, does it mean > >>>> you need per-memcg group (I assume each service has its own memcg) THP > >>>> configuration? > >>> > >>> With this new BPF hook, we can manage THP behavior either per-service > >>> or per-memory. > >>> In our use case, we’ve chosen memcg-based control for finer-grained > >>> management. Below is a simplified example of our implementation: > >>> > >>> struct{ > >>> __uint(type, BPF_MAP_TYPE_HASH); > >>> __uint(max_entries, 4096); /* usually there won't too > >>> many cgroups */ > >>> __type(key, u64); > >>> __type(value, u32); > >>> __uint(map_flags, BPF_F_NO_PREALLOC); > >>> } thp_whitelist SEC(".maps"); > >>> > >>> SEC("fmod_ret/mm_bpf_thp_vma_allowable") > >>> int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma) > >>> { > >>> struct cgroup_subsys_state *css; > >>> struct css_set *cgroups; > >>> struct mm_struct *mm; > >>> struct cgroup *cgroup; > >>> struct cgroup *parent; > >>> struct task_struct *p; > >>> u64 cgrp_id; > >>> > >>> if (!vma) > >>> return 0; > >>> > >>> mm = vma->vm_mm; > >>> if (!mm) > >>> return 0; > >>> > >>> p = mm->owner; > >>> cgroups = p->cgroups; > >>> cgroup = cgroups->subsys[memory_cgrp_id]->cgroup; > >>> cgrp_id = cgroup->kn->id; > >>> > >>> /* Allow the tasks in the thp_whiltelist to use THP. */ > >>> if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id)) > >>> return 1; > >>> return 0; > >>> } > >>> > >>> I chose not to include this in the self-tests to avoid the complexity > >>> of setting up cgroups for testing purposes. However, in patch #4 of > >>> this series, I've included a simpler example demonstrating task-level > >>> control. > >> > >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)? > > > > You’ll need to modify the user-space code—and again, this likely > > wouldn’t be a concern if you were managing a large fleet of servers. > > > >> > >>> For service-level control, we could potentially utilize BPF task local > >>> storage as an alternative approach. > >> > >> +cgroup people > >> > >> For service-level control, there was a proposal of adding cgroup based > >> THP control[1]. You might need a strong use case to convince people. > >> > >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ > > > > Thanks for the reference. I've reviewed the related discussion, and if > > I understand correctly, the proposal was rejected by the maintainers. > > I wonder why your approach is better than the cgroup based THP control proposal. It’s more flexible, and you can still use it even without cgroups. One limitation is that CONFIG_MEMCG must be enabled due to the use of mm_struct::owner. I'm wondering if it would be feasible to decouple mm_struct::owner from CONFIG_MEMCG. Alternatively, if there’s another reliable way to retrieve the task_struct without relying on mm_struct::owner, we could consider adding BPF kfuncs to support it. -- Regards Yafang