On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > > On 29 Apr 2025, at 22:33, Yafang Shao wrote: > > > On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@xxxxxxxxxx> wrote: > >> > >> Hi Yafang, > >> > >> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing > >> people there in your next version? (I added them here) > >> > >> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589 > > > > Thanks for your reminder. > > I will add the maintainers and reviewers in the next version. > > > >> > >> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote: > >>> In our container environment, we aim to enable THP selectively—allowing > >>> specific services to use it while restricting others. This approach is > >>> driven by the following considerations: > >>> > >>> 1. Memory Fragmentation > >>> THP can lead to increased memory fragmentation, so we want to limit its > >>> use across services. > >>> 2. Performance Impact > >>> Some services see no benefit from THP, making its usage unnecessary. > >>> 3. Performance Gains > >>> Certain workloads, such as machine learning services, experience > >>> significant performance improvements with THP, so we enable it for them > >>> specifically. > >>> > >>> Since multiple services run on a single host in a containerized environment, > >>> enabling THP globally is not ideal. Previously, we set THP to madvise, > >>> allowing selected services to opt in via MADV_HUGEPAGE. However, this > >>> approach had limitation: > >>> > >>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through > >>> third-party libraries, bypassing our restrictions. > >> > >> Basically, you want more precise control of THP enablement and the > >> ability of overriding madvise() from userspace. > >> > >> In terms of overriding madvise(), do you have any concrete example of > >> these third-party libraries? madvise() users are supposed to know what > >> they are doing, so I wonder why they are causing trouble in your > >> environment. > > > > To my knowledge, jemalloc [0] supports THP. > > Applications using jemalloc typically rely on its default > > configurations rather than explicitly enabling or disabling THP. If > > the system is configured with THP=madvise, these applications may > > automatically leverage THP where appropriate > > > > [0]. https://github.com/jemalloc/jemalloc > > It sounds like a userspace issue. For jemalloc, if applications require > it, can't you replace the jemalloc with a one compiled with --disable-thp > to work around the issue? That’s not the issue this patchset is trying to address or work around. I believe we should focus on the actual problem it's meant to solve. By the way, you might not raise this question if you were managing a large fleet of servers. We're a platform provider, but we don’t maintain all the packages ourselves. Users make their own choices based on their specific requirements. It's not a feasible solution for us to develop and maintain every package. > > > > >> > >>> > >>> To address this issue, we initially hooked the __x64_sys_madvise() syscall, > >>> which is error-injectable, to blacklist unwanted services. While this > >>> worked, it was error-prone and ineffective for services needing always mode, > >>> as modifying their code to use madvise was impractical. > >>> > >>> To achieve finer-grained control, we introduced an fmod_ret-based solution. > >>> Now, we dynamically adjust THP settings per service by hooking > >>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to > >>> enable or disable on a per-service basis without global impact. > >> > >> hugepage_global_*() are whole system knobs. How did you use it to > >> achieve per-service control? In terms of per-service, does it mean > >> you need per-memcg group (I assume each service has its own memcg) THP > >> configuration? > > > > With this new BPF hook, we can manage THP behavior either per-service > > or per-memory. > > In our use case, we’ve chosen memcg-based control for finer-grained > > management. Below is a simplified example of our implementation: > > > > struct{ > > __uint(type, BPF_MAP_TYPE_HASH); > > __uint(max_entries, 4096); /* usually there won't too > > many cgroups */ > > __type(key, u64); > > __type(value, u32); > > __uint(map_flags, BPF_F_NO_PREALLOC); > > } thp_whitelist SEC(".maps"); > > > > SEC("fmod_ret/mm_bpf_thp_vma_allowable") > > int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma) > > { > > struct cgroup_subsys_state *css; > > struct css_set *cgroups; > > struct mm_struct *mm; > > struct cgroup *cgroup; > > struct cgroup *parent; > > struct task_struct *p; > > u64 cgrp_id; > > > > if (!vma) > > return 0; > > > > mm = vma->vm_mm; > > if (!mm) > > return 0; > > > > p = mm->owner; > > cgroups = p->cgroups; > > cgroup = cgroups->subsys[memory_cgrp_id]->cgroup; > > cgrp_id = cgroup->kn->id; > > > > /* Allow the tasks in the thp_whiltelist to use THP. */ > > if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id)) > > return 1; > > return 0; > > } > > > > I chose not to include this in the self-tests to avoid the complexity > > of setting up cgroups for testing purposes. However, in patch #4 of > > this series, I've included a simpler example demonstrating task-level > > control. > > For task-level control, why not using prctl(PR_SET_THP_DISABLE)? You’ll need to modify the user-space code—and again, this likely wouldn’t be a concern if you were managing a large fleet of servers. > > > For service-level control, we could potentially utilize BPF task local > > storage as an alternative approach. > > +cgroup people > > For service-level control, there was a proposal of adding cgroup based > THP control[1]. You might need a strong use case to convince people. > > [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ Thanks for the reference. I've reviewed the related discussion, and if I understand correctly, the proposal was rejected by the maintainers. -- Regards Yafang