Hi Chen Yu On 4/30/25 03:36, Chen Yu wrote: > On systems with NUMA balancing enabled, it is found that tracking > the task activities due to NUMA balancing is helpful. NUMA balancing > has two mechanisms for task migration: one is to migrate the task to > an idle CPU in its preferred node, the other is to swap tasks on > different nodes if they are on each other's preferred node. > > The kernel already has NUMA page migration statistics in > /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched, > but does not have statistics for task migration/swap. > Add the task migration and swap count accordingly. > > The following two new fields: > > numa_task_migrated > numa_task_swapped > > will be displayed in both > /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched > Both stats show up in expected places, but I notice they are also in /proc/vmstat and are always 0. I think you may have to add count_vm_numa_event() in migrate_task_to() and __migrate_swap_task() unless there is a way to not show both stats in /proc/vmstat. > Introducing both pertask and permemcg NUMA balancing statistics helps > to quickly evaluate the performance and resource usage of the target > workload. For example, the user can first identify the container which > has high NUMA balance activity and then narrow down to a specific task > within that group, and tune the memory policy of that task. > In summary, it is plausible to iterate the /proc/$pid/sched to find the > offending task, but the introduction of per memcg tasks' Numa balancing > aggregated activity can further help users identify the task in a > divide-and-conquer way. > > Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx> > Tested-by: Madadi Vineeth Reddy <vineethr@xxxxxxxxxxxxx> > Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> > Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx> > --- > v2->v3: > Remove unnecessary p->mm check because kernel threads are > not supported by Numa Balancing. (Libo Chen) > v1->v2: > Update the Documentation/admin-guide/cgroup-v2.rst. (Michal) > --- > Documentation/admin-guide/cgroup-v2.rst | 6 ++++++ > include/linux/sched.h | 4 ++++ > include/linux/vm_event_item.h | 2 ++ > kernel/sched/core.c | 7 +++++-- > kernel/sched/debug.c | 4 ++++ > mm/memcontrol.c | 2 ++ > mm/vmstat.c | 2 ++ > 7 files changed, 25 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 1a16ce68a4d7..d346f3235945 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1670,6 +1670,12 @@ The following nested keys are defined. > numa_hint_faults (npn) > Number of NUMA hinting faults. > > + numa_task_migrated (npn) > + Number of task migration by NUMA balancing. > + > + numa_task_swapped (npn) > + Number of task swap by NUMA balancing. > + > pgdemote_kswapd > Number of pages demoted by kswapd. > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index f96ac1982893..1c50e30b5c01 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -549,6 +549,10 @@ struct sched_statistics { > u64 nr_failed_migrations_running; > u64 nr_failed_migrations_hot; > u64 nr_forced_migrations; > +#ifdef CONFIG_NUMA_BALANCING > + u64 numa_task_migrated; > + u64 numa_task_swapped; > +#endif > This one is more of personal preference. I understand they show up only if you turn on schedstats, but will it be better to put them in sched_show_numa() so they will be printed out next to other numa stats such as numa_pages_migrated? @@ -1153,6 +1153,10 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m) if (p->mm) P(mm->numa_scan_seq); + if (schedstat_enabled()) { + P_SCHEDSTAT(numa_task_migrated); + P_SCHEDSTAT(numa_task_swapped); + } P(numa_pages_migrated); P(numa_preferred_nid); P(total_numa_faults); Thanks, Libo > u64 nr_wakeups; > u64 nr_wakeups_sync; > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 9e15a088ba38..91a3ce9a2687 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -66,6 +66,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > NUMA_HINT_FAULTS, > NUMA_HINT_FAULTS_LOCAL, > NUMA_PAGE_MIGRATE, > + NUMA_TASK_MIGRATE, > + NUMA_TASK_SWAP, > #endif > #ifdef CONFIG_MIGRATION > PGMIGRATE_SUCCESS, PGMIGRATE_FAIL, > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index c81cf642dba0..25a92f2abda4 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3352,6 +3352,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) > #ifdef CONFIG_NUMA_BALANCING > static void __migrate_swap_task(struct task_struct *p, int cpu) > { > + __schedstat_inc(p->stats.numa_task_swapped); > + count_memcg_events_mm(p->mm, NUMA_TASK_SWAP, 1); > + > if (task_on_rq_queued(p)) { > struct rq *src_rq, *dst_rq; > struct rq_flags srf, drf; > @@ -7953,8 +7956,8 @@ int migrate_task_to(struct task_struct *p, int target_cpu) > if (!cpumask_test_cpu(target_cpu, p->cpus_ptr)) > return -EINVAL; > > - /* TODO: This is not properly updating schedstats */ > - > + __schedstat_inc(p->stats.numa_task_migrated); > + count_memcg_events_mm(p->mm, NUMA_TASK_MIGRATE, 1); > trace_sched_move_numa(p, curr_cpu, target_cpu); > return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg); > } > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c > index 56ae54e0ce6a..f971c2af7912 100644 > --- a/kernel/sched/debug.c > +++ b/kernel/sched/debug.c > @@ -1206,6 +1206,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, > P_SCHEDSTAT(nr_failed_migrations_running); > P_SCHEDSTAT(nr_failed_migrations_hot); > P_SCHEDSTAT(nr_forced_migrations); > +#ifdef CONFIG_NUMA_BALANCING > + P_SCHEDSTAT(numa_task_migrated); > + P_SCHEDSTAT(numa_task_swapped); > +#endif > P_SCHEDSTAT(nr_wakeups); > P_SCHEDSTAT(nr_wakeups_sync); > P_SCHEDSTAT(nr_wakeups_migrate); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c96c1f2b9cf5..cdaab8a957f3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -463,6 +463,8 @@ static const unsigned int memcg_vm_event_stat[] = { > NUMA_PAGE_MIGRATE, > NUMA_PTE_UPDATES, > NUMA_HINT_FAULTS, > + NUMA_TASK_MIGRATE, > + NUMA_TASK_SWAP, > #endif > }; > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 4c268ce39ff2..ed08bb384ae4 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1347,6 +1347,8 @@ const char * const vmstat_text[] = { > "numa_hint_faults", > "numa_hint_faults_local", > "numa_pages_migrated", > + "numa_task_migrated", > + "numa_task_swapped", > #endif > #ifdef CONFIG_MIGRATION > "pgmigrate_success",