Re: [PATCH v3] sched/numa: add statistics of numa balance task migration

"Chen, Yu C" <yu.c.chen@xxxxxxxxx> · Tue, 6 May 2025 13:36:54 +0800




On 5/6/2025 5:57 AM, Libo Chen wrote:


On 5/5/25 14:32, Libo Chen wrote:


On 5/5/25 11:49, Libo Chen wrote:


On 5/5/25 11:27, Chen, Yu C wrote:
Hi Michal,

On 5/6/2025 1:46 AM, Michal Koutný wrote:
On Mon, May 05, 2025 at 11:03:10PM +0800, "Chen, Yu C" <yu.c.chen@xxxxxxxxx> wrote:
According to this address,
     4c 8b af 50 09 00 00    mov    0x950(%rdi),%r13  <--- r13 = p->mm;
     49 8b bd 98 04 00 00    mov    0x498(%r13),%rdi  <--- p->mm->owner
It seems that this task to be swapped has NULL mm_struct.

So it's likely a kernel thread. Does it make sense to NUMA balance
those? (I naïvely think it doesn't, please correct me.) ...


I agree kernel threads are not supposed to be covered by
NUMA balance, because currently NUMA balance only considers
user pages via VMAs, and one question below:

   static void __migrate_swap_task(struct task_struct *p, int cpu)
   {
          __schedstat_inc(p->stats.numa_task_swapped);
-       count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
+       if (p->mm)
+               count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);

... proper fix should likely guard this earlier, like the guard in
task_numa_fault() but for the other swapped task.
I see. For task swapping in task_numa_compare(),
it is triggered when there are no idle CPUs in task A's
preferred node.
In this case, we choose a task B on A's preferred node,
and swap B with A. This helps improve A's Numa locality
without introducing the load imbalance between Nodes.

Hi Chenyu

There are two problems here:
1. Many kthreads are pinned, with all the efforts in task_numa_compare()
and task_numa_find_cpu(), the swapping may not end up happening. I only see a
check on source task: cpumask_test_cpu(cpu, env->p->cpus_ptr) but not dst task.

NVM I was blind. There is a check on dst task in task_numa_compare()

2. Assuming B is migratable, that can potentially make B worse, right? I think
some kthreads are quite cache-sensitive, and we swap like their locality doesn't
matter.

This makes sense. I wonder if it could be extended beyond kthreads.
We don't want to swap task B that has no explicit NUMA preference,
do we?


Ideally we probably just want to stay off kthreads, if we cannot find any others
p->mm tasks, just don't swap (?). That sounds like a brand new patch though.


A change as simple as that should work:

@@ -2492,7 +2492,7 @@ static bool task_numa_compare(struct task_numa_env *env,

         rcu_read_lock();
         cur = rcu_dereference(dst_rq->curr);
-       if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
+       if (cur && ((cur->flags & PF_EXITING) || !cur->mm || is_idle_task(cur)))

something like
if (cur && ((cur->flags & PF_EXITING) ||
    cur->numa_preferred_nid == NUMA_NO_NODE ||
   !cur->numa_faults || is_idle_task(cur)))

But overall it looks good to me, would you like to post this as a
formal patch, or do you want me to fold your change into a patch set?

thanks,
Chenyu

                 cur = NULL;



 



Libo
But B's Numa node preference is not mandatory in
current implementation IIUC, because B's load is mainly

hmm, that's doesn't seem to be right, can we choose B that
is not a kthread from A's preferred node?

considered. That is to say, is it legit to swap a
Numa sensitive task A with a non-Numa sensitive kernel
thread B? If not, I think we can add kernel thread
check in task swap like the guard in
task_tick_numa()/task_numa_fault().



thanks,
Chenyu


Michal