On 30 Apr 2025, at 10:38, Yafang Shao wrote: > On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@xxxxxxxxxx> wrote: >> >> On 29 Apr 2025, at 22:33, Yafang Shao wrote: >> >>> On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@xxxxxxxxxx> wrote: >>>> >>>> Hi Yafang, >>>> >>>> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing >>>> people there in your next version? (I added them here) >>>> >>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589 >>> >>> Thanks for your reminder. >>> I will add the maintainers and reviewers in the next version. >>> >>>> >>>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote: >>>>> In our container environment, we aim to enable THP selectively—allowing >>>>> specific services to use it while restricting others. This approach is >>>>> driven by the following considerations: >>>>> >>>>> 1. Memory Fragmentation >>>>> THP can lead to increased memory fragmentation, so we want to limit its >>>>> use across services. >>>>> 2. Performance Impact >>>>> Some services see no benefit from THP, making its usage unnecessary. >>>>> 3. Performance Gains >>>>> Certain workloads, such as machine learning services, experience >>>>> significant performance improvements with THP, so we enable it for them >>>>> specifically. >>>>> >>>>> Since multiple services run on a single host in a containerized environment, >>>>> enabling THP globally is not ideal. Previously, we set THP to madvise, >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this >>>>> approach had limitation: >>>>> >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through >>>>> third-party libraries, bypassing our restrictions. >>>> >>>> Basically, you want more precise control of THP enablement and the >>>> ability of overriding madvise() from userspace. >>>> >>>> In terms of overriding madvise(), do you have any concrete example of >>>> these third-party libraries? madvise() users are supposed to know what >>>> they are doing, so I wonder why they are causing trouble in your >>>> environment. >>> >>> To my knowledge, jemalloc [0] supports THP. >>> Applications using jemalloc typically rely on its default >>> configurations rather than explicitly enabling or disabling THP. If >>> the system is configured with THP=madvise, these applications may >>> automatically leverage THP where appropriate >>> >>> [0]. https://github.com/jemalloc/jemalloc >> >> It sounds like a userspace issue. For jemalloc, if applications require >> it, can't you replace the jemalloc with a one compiled with --disable-thp >> to work around the issue? > > That’s not the issue this patchset is trying to address or work > around. I believe we should focus on the actual problem it's meant to > solve. > > By the way, you might not raise this question if you were managing a > large fleet of servers. We're a platform provider, but we don’t > maintain all the packages ourselves. Users make their own choices > based on their specific requirements. It's not a feasible solution for > us to develop and maintain every package. Basically, user wants to use THP, but as a service provider, you think differently, so want to override userspace choice. Am I getting it right? > >> >>> >>>> >>>>> >>>>> To address this issue, we initially hooked the __x64_sys_madvise() syscall, >>>>> which is error-injectable, to blacklist unwanted services. While this >>>>> worked, it was error-prone and ineffective for services needing always mode, >>>>> as modifying their code to use madvise was impractical. >>>>> >>>>> To achieve finer-grained control, we introduced an fmod_ret-based solution. >>>>> Now, we dynamically adjust THP settings per service by hooking >>>>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to >>>>> enable or disable on a per-service basis without global impact. >>>> >>>> hugepage_global_*() are whole system knobs. How did you use it to >>>> achieve per-service control? In terms of per-service, does it mean >>>> you need per-memcg group (I assume each service has its own memcg) THP >>>> configuration? >>> >>> With this new BPF hook, we can manage THP behavior either per-service >>> or per-memory. >>> In our use case, we’ve chosen memcg-based control for finer-grained >>> management. Below is a simplified example of our implementation: >>> >>> struct{ >>> __uint(type, BPF_MAP_TYPE_HASH); >>> __uint(max_entries, 4096); /* usually there won't too >>> many cgroups */ >>> __type(key, u64); >>> __type(value, u32); >>> __uint(map_flags, BPF_F_NO_PREALLOC); >>> } thp_whitelist SEC(".maps"); >>> >>> SEC("fmod_ret/mm_bpf_thp_vma_allowable") >>> int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma) >>> { >>> struct cgroup_subsys_state *css; >>> struct css_set *cgroups; >>> struct mm_struct *mm; >>> struct cgroup *cgroup; >>> struct cgroup *parent; >>> struct task_struct *p; >>> u64 cgrp_id; >>> >>> if (!vma) >>> return 0; >>> >>> mm = vma->vm_mm; >>> if (!mm) >>> return 0; >>> >>> p = mm->owner; >>> cgroups = p->cgroups; >>> cgroup = cgroups->subsys[memory_cgrp_id]->cgroup; >>> cgrp_id = cgroup->kn->id; >>> >>> /* Allow the tasks in the thp_whiltelist to use THP. */ >>> if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id)) >>> return 1; >>> return 0; >>> } >>> >>> I chose not to include this in the self-tests to avoid the complexity >>> of setting up cgroups for testing purposes. However, in patch #4 of >>> this series, I've included a simpler example demonstrating task-level >>> control. >> >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)? > > You’ll need to modify the user-space code—and again, this likely > wouldn’t be a concern if you were managing a large fleet of servers. > >> >>> For service-level control, we could potentially utilize BPF task local >>> storage as an alternative approach. >> >> +cgroup people >> >> For service-level control, there was a proposal of adding cgroup based >> THP control[1]. You might need a strong use case to convince people. >> >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@xxxxxxxxxxxxxxxxxxx/ > > Thanks for the reference. I've reviewed the related discussion, and if > I understand correctly, the proposal was rejected by the maintainers. I wonder why your approach is better than the cgroup based THP control proposal. -- Best Regards, Yan, Zi