On 29 Apr 2025, at 22:33, Yafang Shao wrote: > On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@xxxxxxxxxx> wrote: >> >> Hi Yafang, >> >> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing >> people there in your next version? (I added them here) >> >> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589 > > Thanks for your reminder. > I will add the maintainers and reviewers in the next version. > >> >> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote: >>> In our container environment, we aim to enable THP selectively—allowing >>> specific services to use it while restricting others. This approach is >>> driven by the following considerations: >>> >>> 1. Memory Fragmentation >>> THP can lead to increased memory fragmentation, so we want to limit its >>> use across services. >>> 2. Performance Impact >>> Some services see no benefit from THP, making its usage unnecessary. >>> 3. Performance Gains >>> Certain workloads, such as machine learning services, experience >>> significant performance improvements with THP, so we enable it for them >>> specifically. >>> >>> Since multiple services run on a single host in a containerized environment, >>> enabling THP globally is not ideal. Previously, we set THP to madvise, >>> allowing selected services to opt in via MADV_HUGEPAGE. However, this >>> approach had limitation: >>> >>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through >>> third-party libraries, bypassing our restrictions. >> >> Basically, you want more precise control of THP enablement and the >> ability of overriding madvise() from userspace. >> >> In terms of overriding madvise(), do you have any concrete example of >> these third-party libraries? madvise() users are supposed to know what >> they are doing, so I wonder why they are causing trouble in your >> environment. > > To my knowledge, jemalloc [0] supports THP. > Applications using jemalloc typically rely on its default > configurations rather than explicitly enabling or disabling THP. If > the system is configured with THP=madvise, these applications may > automatically leverage THP where appropriate > > [0]. https://github.com/jemalloc/jemalloc It sounds like a userspace issue. For jemalloc, if applications require it, can't you replace the jemalloc with a one compiled with --disable-thp to work around the issue? > >> >>> >>> To address this issue, we initially hooked the __x64_sys_madvise() syscall, >>> which is error-injectable, to blacklist unwanted services. While this >>> worked, it was error-prone and ineffective for services needing always mode, >>> as modifying their code to use madvise was impractical. >>> >>> To achieve finer-grained control, we introduced an fmod_ret-based solution. >>> Now, we dynamically adjust THP settings per service by hooking >>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to >>> enable or disable on a per-service basis without global impact. >> >> hugepage_global_*() are whole system knobs. How did you use it to >> achieve per-service control? In terms of per-service, does it mean >> you need per-memcg group (I assume each service has its own memcg) THP >> configuration? > > With this new BPF hook, we can manage THP behavior either per-service > or per-memory. > In our use case, we’ve chosen memcg-based control for finer-grained > management. Below is a simplified example of our implementation: > > struct{ > __uint(type, BPF_MAP_TYPE_HASH); > __uint(max_entries, 4096); /* usually there won't too > many cgroups */ > __type(key, u64); > __type(value, u32); > __uint(map_flags, BPF_F_NO_PREALLOC); > } thp_whitelist SEC(".maps"); > > SEC("fmod_ret/mm_bpf_thp_vma_allowable") > int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma) > { > struct cgroup_subsys_state *css; > struct css_set *cgroups; > struct mm_struct *mm; > struct cgroup *cgroup; > struct cgroup *parent; > struct task_struct *p; > u64 cgrp_id; > > if (!vma) > return 0; > > mm = vma->vm_mm; > if (!mm) > return 0; > > p = mm->owner; > cgroups = p->cgroups; > cgroup = cgroups->subsys[memory_cgrp_id]->cgroup; > cgrp_id = cgroup->kn->id; > > /* Allow the tasks in the thp_whiltelist to use THP. */ > if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id)) > return 1; > return 0; > } > > I chose not to include this in the self-tests to avoid the complexity > of setting up cgroups for testing purposes. However, in patch #4 of > this series, I've included a simpler example demonstrating task-level > control. For task-level control, why not using prctl(PR_SET_THP_DISABLE)? > For service-level control, we could potentially utilize BPF task local > storage as an alternative approach. +cgroup people For service-level control, there was a proposal of adding cgroup based THP control[1]. You might need a strong use case to convince people. [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ -- Best Regards, Yan, Zi