Hi Chenyu On 7/1/25 09:36, Chen, Yu C wrote: > Hi Libo, > > On 7/1/2025 4:25 PM, Libo Chen wrote: >> Hi Chenyu, >> >> Thanks for the patch. See my comments below. >> >> On 6/25/25 03:23, Chen Yu wrote: >>> [Problem Statement] >>> Currently, NUMA balancing is configured system-wide. >>> However, in some production environments, different >>> cgroups may have varying requirements for NUMA balancing. >>> Some cgroups are CPU-intensive, while others are >>> memory-intensive. Some do not benefit from NUMA balancing >>> due to the overhead associated with VMA scanning, while >>> others prefer NUMA balancing as it helps improve memory >>> locality. In this case, system-wide NUMA balancing is >>> usually disabled to avoid causing regressions. >>> >>> [Proposal] >>> Introduce a per-cgroup interface to enable NUMA balancing >>> for specific cgroups. This interface is associated with >>> the CPU subsystem, which does not support threaded subtrees, >>> and close to CPU bandwidth control. The system administrator >>> needs to set the NUMA balancing mode to >>> NUMA_BALANCING_CGROUP=4 to enable this feature. When the >>> system is in NUMA_BALANCING_CGROUP mode, NUMA balancing >>> for all cgroups is disabled by default. After the >>> administrator enables this feature for a specific cgroup, >>> NUMA balancing for that cgroup is enabled. >>> >>> A simple example to show how to use per-cgroup NUMA balancing: >>> >>> Step1 >>> //switch on per cgroup Numa balancing. >>> //All cgroup's NUMA balance is disabled by default. >>> echo 4 > /proc/sys/kernel/numa_balancing >>> >>> Step2 >>> //created a cgroup named mytest, enable its NUMA balancing >>> echo 1 > /sys/fs/cgroup/mytest/cpu.numa_load_balance >>> >>> [Benchmark] >>> Tested on Xeon Sapphire Rapids, with 4 Numa nodes. Created >>> a cgroup mytest and launched autonumabench NUMA01_THREADLOCAL. >>> Each test runs 6 cycles. >>> >>> baseline: >>> 29360.56user 16280.68system 3:33.29elapsed >>> 29788.41user 16060.31system 3:34.38elapsed >>> 28307.51user 17043.45system 3:33.03elapsed >>> 29552.49user 16307.65system 3:34.20elapsed >>> 29847.41user 15966.15system 3:34.65elapsed >>> 29111.10user 16532.78system 3:33.19elapsed >>> >>> per cgroup NUMA balance: >>> 7589.78user 16494.90system 1:53.18elapsed >>> 7795.54user 16537.65system 1:54.11elapsed >>> 8295.66user 16391.21system 1:55.98elapsed >>> 7836.34user 17312.31system 1:55.71elapsed >>> 7773.26user 16856.19system 1:54.08elapsed >>> 7534.43user 17604.58system 1:55.01elapsed >>> >>> The user time has been reduced to 33% of the >>> original, and the elapsed time has dropped to >>> 45% of the original (lower values are better). >>> >>> cat /sys/fs/cgroup/mytest/memory.stat | grep numa >>> numa_pages_migrated 10238503 >>> numa_pte_updates 24378124 >>> numa_hint_faults 16921590 >>> numa_task_migrated 253 >>> numa_task_swapped 4 >>> >>> to-do: >>> Per-cgroup NUMA balancing should consider the >>> hierarchy of the cgroup. Initially, NUMA balancing >>> is disabled for the root cgroup. A cgroup that has >>> NUMA balancing enabled should have all its parents >>> enabled. For example, suppose cgroup A is the parent >>> of cgroup B; if A.numa_load_balance is 0, even if >>> B.numa_load_balance is 1, NUMA balancing for B is >>> disabled. This idea is derived from >>> commit e39925734909 ("mm/memcontrol: respect >>> zswap.writeback setting from parent cgroup too"). >>> >>> Suggested-by: Tim Chen <tim.c.chen@xxxxxxxxx> >>> Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx> >>> --- >>> v1->v2: >>> >>> Add documentation in Documentation/admin-guide/sysctl/kernel.rst. >>> >>> Add comments in tg_numa_balance_enabled() to explain that >>> the newly introduced NUMA balancing mode is naturally >>> exclusive of existing NUMA balancing modes. (Tim) >>> --- >>> Documentation/admin-guide/sysctl/kernel.rst | 6 ++++ >>> include/linux/sched/sysctl.h | 1 + >>> kernel/sched/core.c | 31 +++++++++++++++++++++ >>> kernel/sched/fair.c | 28 +++++++++++++++++++ >>> kernel/sched/sched.h | 3 ++ >>> mm/mprotect.c | 5 ++-- >>> 6 files changed, 72 insertions(+), 2 deletions(-) >>> >>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst >>> index dd49a89a62d3..ff88d1153c19 100644 >>> --- a/Documentation/admin-guide/sysctl/kernel.rst >>> +++ b/Documentation/admin-guide/sysctl/kernel.rst >>> @@ -709,6 +709,7 @@ The value to set can be the result of ORing the following: >>> 0 NUMA_BALANCING_DISABLED >>> 1 NUMA_BALANCING_NORMAL >>> 2 NUMA_BALANCING_MEMORY_TIERING >>> +4 NUMA_BALANCING_CGROUP >>> = ================================= >>> Or NUMA_BALANCING_NORMAL to optimize page placement among different >>> @@ -729,6 +730,11 @@ different types of memory (represented as different NUMA nodes) to >>> place the hot pages in the fast memory. This is implemented based on >>> unmapping and page fault too. >>> +Or NUMA_BALANCING_CGROUP to enable the per cgroup NUMA balancing. >>> +This new behavior can be opted-in/out on a per-cgroup basis via a new >>> +cgroup CPU subsystem file named numa_load_balance. By default, per >>> +cgroup NUMA balancing for each cgroup is enabled. >>> + >>> numa_balancing_promote_rate_limit_MBps >>> ====================================== >>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h >>> index 5a64582b086b..1e4d5a9ddb26 100644 >>> --- a/include/linux/sched/sysctl.h >>> +++ b/include/linux/sched/sysctl.h >>> @@ -22,6 +22,7 @@ enum sched_tunable_scaling { >>> #define NUMA_BALANCING_DISABLED 0x0 >>> #define NUMA_BALANCING_NORMAL 0x1 >>> #define NUMA_BALANCING_MEMORY_TIERING 0x2 >>> +#define NUMA_BALANCING_CGROUP 0x4 >>> #ifdef CONFIG_NUMA_BALANCING >>> extern int sysctl_numa_balancing_mode; >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index 8988d38d46a3..8e9aa59193df 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -10078,6 +10078,30 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of, >>> } >>> #endif >>> +#ifdef CONFIG_NUMA_BALANCING >>> +static int numa_balance_write_u64(struct cgroup_subsys_state *css, >>> + struct cftype *cftype, u64 enable) >>> +{ >>> + struct task_group *tg; >>> + bool was_enabled; >>> + >>> + tg = css_tg(css); >>> + was_enabled = READ_ONCE(tg->nlb_enabled); >>> + if (was_enabled == enable) >>> + return 0; >>> + >>> + WRITE_ONCE(tg->nlb_enabled, enable); >>> + >>> + return 0; >>> +} >>> + >>> +static u64 numa_balance_read_u64(struct cgroup_subsys_state *css, >>> + struct cftype *cft) >>> +{ >>> + return READ_ONCE(css_tg(css)->nlb_enabled); >>> +} >>> +#endif /* CONFIG_NUMA_BALANCING */ >>> + >>> static struct cftype cpu_files[] = { >>> #ifdef CONFIG_GROUP_SCHED_WEIGHT >>> { >>> @@ -10126,6 +10150,13 @@ static struct cftype cpu_files[] = { >>> .seq_show = cpu_uclamp_max_show, >>> .write = cpu_uclamp_max_write, >>> }, >>> +#endif >>> +#ifdef CONFIG_NUMA_BALANCING >>> + { >>> + .name = "numa_load_balance", >>> + .read_u64 = numa_balance_read_u64, >>> + .write_u64 = numa_balance_write_u64, >>> + }, >>> #endif >>> { } /* terminate */ >>> }; >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 7a14da5396fb..dcdee3bf9960 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -3161,6 +3161,29 @@ void task_numa_free(struct task_struct *p, bool final) >>> } >>> } >>> +/* >>> + * Return true if the NUMA balance is allowed for >>> + * the task in a task group. >>> + */ >>> +static bool tg_numa_balance_enabled(struct task_struct *p) >>> +{ >>> + /* >>> + * The min/max of sysctl_numa_balancing_mode ranges >>> + * from SYSCTL_ONE to SYSCTL_FOUR, so it is safe to >>> + * only check NUMA_BALANCING_CGROUP because it is >>> + * impossible to have both NUMA_BALANCING_CGROUP and >>> + * NUMA_BALANCING_NORMAL/NUMA_BALANCING_MEMORY_TIERING >>> + * set. >>> + */ >>> + struct task_group *tg = task_group(p); >>> + >>> + if (tg && (sysctl_numa_balancing_mode & NUMA_BALANCING_CGROUP) && >>> + !READ_ONCE(tg->nlb_enabled)) >>> + return false; >>> + >>> + return true; >>> +} >>> + >>> /* >>> * Got a PROT_NONE fault for a page on @node. >>> */ >>> @@ -3189,6 +3212,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) >>> !cpupid_valid(last_cpupid))) >>> return; >>> + if (!tg_numa_balance_enabled(p)) >>> + return; >>> + >> >> I think this one may be redundant when you already have it in task_numa_work(). Without the >> scanning, there won't be any hinting page faults on that task, so neither do_numa_page() nor >> do_huge_pmd_numa_page() will be called. Though it's a minor issue if tg_numa_balance_enabled(p) >> is fast. >> > > Previously I was thinking of the following sequence: > 1. the NUMA balancing is enabled and task_numa_work() is invoked, > pages are scanned and PROT_NONE is set. > 2. cgroup NUMA balancing is disabled by the user > 3. do_numa_page() is triggered and PROT_NONE is cleared. > We don't want to do further task migration and > task_numa_fault() bails out.(page migration is still > allowed as we mainly want to control the behavior of > the task) > Ah right, that makes sense. Does that fall under unlikely()? The timing window seems to be quite small to me. >> Overall this is good. But more generally, I am thinking something finer-grained, like per-task >> numab control with numab tunnables at task-level (if possible), that will be so much more useful >> at least for us. There are use cases for controlling numa balancing at task level as applications >> tuned for NUMA (that don't want numab mess with their tsk/mem placements) such as databases can >> be in the same cgroup with other untuned applications, or not in a cgroup at all. Right now we >> have to turn off numab globally but that's not really optimal in a lot of cases. I do understand >> your use cases for per-cgroup control, but I wonder if there is a way to nicely combine them. >> Per-task control should cover per-cgroup control functionality-wise, but it's an inconvenient >> interface as one has to set for all tasks of the same cgroup, > > OK. Michal has also suggested using the per-task interface > (prctl()/sched_setattr()) for NUMA balancing instead of per-cgroup > control. In theory, I think it is doable. In a cloud environment, > users can set the attribute (enable NUMA balancing) for the first > process of a cgroup, and later child processes will inherit this > attribute. But yes, when the admin decides to change this attribute, > each process of the cgroup has to be iterated. > An on/off button for cgroup could be added to libcgroup or similar for sysadmins, but still inside the kernel, each task's attribute will be set one by one. What I am more worried about is cgroups that frequently have tasks join and leave, it will become very annoying to set/unset at each entry and exit. >> I haven't thought too hard about >> it yet, just want to bring it out and see if we can work out something together. >> > > Sure, let me have a try on this per-task version and we can > discuss/co-work on that. > Cool~ Looking forward to it. > thanks, > Chenyu > >> Thanks, >> Libo >> >>> /* Allocate buffer to track faults on a per-node basis */ >>> if (unlikely(!p->numa_faults)) { >>> int size = sizeof(*p->numa_faults) * >>> @@ -3330,6 +3356,8 @@ static void task_numa_work(struct callback_head *work) >>> if (p->flags & PF_EXITING) >>> return; >>> + if (!tg_numa_balance_enabled(p)) >>> + return; >>> /* >>> * Memory is pinned to only one NUMA node via cpuset.mems, naturally >>> * no page can be migrated. >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index 475bb5998295..4b0dc656688e 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -485,6 +485,9 @@ struct task_group { >>> /* Effective clamp values used for a task group */ >>> struct uclamp_se uclamp[UCLAMP_CNT]; >>> #endif >>> +#ifdef CONFIG_NUMA_BALANCING >>> + u64 nlb_enabled; >>> +#endif >>> }; >>> diff --git a/mm/mprotect.c b/mm/mprotect.c >>> index 88608d0dc2c2..c288ffb92bfc 100644 >>> --- a/mm/mprotect.c >>> +++ b/mm/mprotect.c >>> @@ -155,10 +155,11 @@ static long change_pte_range(struct mmu_gather *tlb, >>> toptier = node_is_toptier(nid); >>> /* >>> - * Skip scanning top tier node if normal numa >>> + * Skip scanning top tier node if normal and cgroup numa >>> * balancing is disabled >>> */ >>> - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && >>> + if (!(sysctl_numa_balancing_mode & >>> + (NUMA_BALANCING_CGROUP | NUMA_BALANCING_NORMAL)) && >>> toptier) >>> continue; >>> if (folio_use_access_time(folio)) >>