On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > > Hi Yafang, > > We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing > people there in your next version? (I added them here) > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589 Thanks for your reminder. I will add the maintainers and reviewers in the next version. > > On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote: > > In our container environment, we aim to enable THP selectively—allowing > > specific services to use it while restricting others. This approach is > > driven by the following considerations: > > > > 1. Memory Fragmentation > > THP can lead to increased memory fragmentation, so we want to limit its > > use across services. > > 2. Performance Impact > > Some services see no benefit from THP, making its usage unnecessary. > > 3. Performance Gains > > Certain workloads, such as machine learning services, experience > > significant performance improvements with THP, so we enable it for them > > specifically. > > > > Since multiple services run on a single host in a containerized environment, > > enabling THP globally is not ideal. Previously, we set THP to madvise, > > allowing selected services to opt in via MADV_HUGEPAGE. However, this > > approach had limitation: > > > > - Some services inadvertently used madvise(MADV_HUGEPAGE) through > > third-party libraries, bypassing our restrictions. > > Basically, you want more precise control of THP enablement and the > ability of overriding madvise() from userspace. > > In terms of overriding madvise(), do you have any concrete example of > these third-party libraries? madvise() users are supposed to know what > they are doing, so I wonder why they are causing trouble in your > environment. To my knowledge, jemalloc [0] supports THP. Applications using jemalloc typically rely on its default configurations rather than explicitly enabling or disabling THP. If the system is configured with THP=madvise, these applications may automatically leverage THP where appropriate [0]. https://github.com/jemalloc/jemalloc > > > > > To address this issue, we initially hooked the __x64_sys_madvise() syscall, > > which is error-injectable, to blacklist unwanted services. While this > > worked, it was error-prone and ineffective for services needing always mode, > > as modifying their code to use madvise was impractical. > > > > To achieve finer-grained control, we introduced an fmod_ret-based solution. > > Now, we dynamically adjust THP settings per service by hooking > > hugepage_global_{enabled,always}() via BPF. This allows us to set THP to > > enable or disable on a per-service basis without global impact. > > hugepage_global_*() are whole system knobs. How did you use it to > achieve per-service control? In terms of per-service, does it mean > you need per-memcg group (I assume each service has its own memcg) THP > configuration? With this new BPF hook, we can manage THP behavior either per-service or per-memory. In our use case, we’ve chosen memcg-based control for finer-grained management. Below is a simplified example of our implementation: struct{ __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 4096); /* usually there won't too many cgroups */ __type(key, u64); __type(value, u32); __uint(map_flags, BPF_F_NO_PREALLOC); } thp_whitelist SEC(".maps"); SEC("fmod_ret/mm_bpf_thp_vma_allowable") int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma) { struct cgroup_subsys_state *css; struct css_set *cgroups; struct mm_struct *mm; struct cgroup *cgroup; struct cgroup *parent; struct task_struct *p; u64 cgrp_id; if (!vma) return 0; mm = vma->vm_mm; if (!mm) return 0; p = mm->owner; cgroups = p->cgroups; cgroup = cgroups->subsys[memory_cgrp_id]->cgroup; cgrp_id = cgroup->kn->id; /* Allow the tasks in the thp_whiltelist to use THP. */ if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id)) return 1; return 0; } I chose not to include this in the self-tests to avoid the complexity of setting up cgroups for testing purposes. However, in patch #4 of this series, I've included a simpler example demonstrating task-level control. For service-level control, we could potentially utilize BPF task local storage as an alternative approach. -- Regards Yafang