On 5/5/25 22:36, Chen, Yu C wrote: > On 5/6/2025 5:57 AM, Libo Chen wrote: >> >> >> On 5/5/25 14:32, Libo Chen wrote: >>> >>> >>> On 5/5/25 11:49, Libo Chen wrote: >>>> >>>> >>>> On 5/5/25 11:27, Chen, Yu C wrote: >>>>> Hi Michal, >>>>> >>>>> On 5/6/2025 1:46 AM, Michal Koutný wrote: >>>>>> On Mon, May 05, 2025 at 11:03:10PM +0800, "Chen, Yu C" <yu.c.chen@xxxxxxxxx> wrote: >>>>>>> According to this address, >>>>>>> 4c 8b af 50 09 00 00 mov 0x950(%rdi),%r13 <--- r13 = p->mm; >>>>>>> 49 8b bd 98 04 00 00 mov 0x498(%r13),%rdi <--- p->mm->owner >>>>>>> It seems that this task to be swapped has NULL mm_struct. >>>>>> >>>>>> So it's likely a kernel thread. Does it make sense to NUMA balance >>>>>> those? (I naïvely think it doesn't, please correct me.) ... >>>>>> >>>>> >>>>> I agree kernel threads are not supposed to be covered by >>>>> NUMA balance, because currently NUMA balance only considers >>>>> user pages via VMAs, and one question below: >>>>> >>>>>>> static void __migrate_swap_task(struct task_struct *p, int cpu) >>>>>>> { >>>>>>> __schedstat_inc(p->stats.numa_task_swapped); >>>>>>> - count_memcg_event_mm(p->mm, NUMA_TASK_SWAP); >>>>>>> + if (p->mm) >>>>>>> + count_memcg_event_mm(p->mm, NUMA_TASK_SWAP); >>>>>> >>>>>> ... proper fix should likely guard this earlier, like the guard in >>>>>> task_numa_fault() but for the other swapped task. >>>>> I see. For task swapping in task_numa_compare(), >>>>> it is triggered when there are no idle CPUs in task A's >>>>> preferred node. >>>>> In this case, we choose a task B on A's preferred node, >>>>> and swap B with A. This helps improve A's Numa locality >>>>> without introducing the load imbalance between Nodes. >>>>> >>> Hi Chenyu >>> >>> There are two problems here: >>> 1. Many kthreads are pinned, with all the efforts in task_numa_compare() >>> and task_numa_find_cpu(), the swapping may not end up happening. I only see a >>> check on source task: cpumask_test_cpu(cpu, env->p->cpus_ptr) but not dst task. >> >> NVM I was blind. There is a check on dst task in task_numa_compare() >> >>> 2. Assuming B is migratable, that can potentially make B worse, right? I think >>> some kthreads are quite cache-sensitive, and we swap like their locality doesn't >>> matter. > > This makes sense. I wonder if it could be extended beyond kthreads. > We don't want to swap task B that has no explicit NUMA preference, > do we? > I agree, at least that should be the default behavior. >>> >>> Ideally we probably just want to stay off kthreads, if we cannot find any others >>> p->mm tasks, just don't swap (?). That sounds like a brand new patch though. >>> >> >> A change as simple as that should work: >> >> @@ -2492,7 +2492,7 @@ static bool task_numa_compare(struct task_numa_env *env, >> >> rcu_read_lock(); >> cur = rcu_dereference(dst_rq->curr); >> - if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur))) >> + if (cur && ((cur->flags & PF_EXITING) || !cur->mm || is_idle_task(cur))) > > something like > if (cur && ((cur->flags & PF_EXITING) || > cur->numa_preferred_nid == NUMA_NO_NODE || > !cur->numa_faults || is_idle_task(cur))) > This implicitly skips kthreads, probably need some comment. Otherwise LGTM > But overall it looks good to me, would you like to post this as a > formal patch, or do you want me to fold your change into a patch set? > You can fold it into one set. Thanks, Libo > thanks, > Chenyu > >> cur = NULL; >> > > > > >>> >>> >>> Libo >>>>> But B's Numa node preference is not mandatory in >>>>> current implementation IIUC, because B's load is mainly >>>> >>>> hmm, that's doesn't seem to be right, can we choose B that >>>> is not a kthread from A's preferred node? >>>> >>>>> considered. That is to say, is it legit to swap a >>>>> Numa sensitive task A with a non-Numa sensitive kernel >>>>> thread B? If not, I think we can add kernel thread >>>>> check in task swap like the guard in >>>>> task_tick_numa()/task_numa_fault(). >>>>> >>>> >>>> >>>>> thanks, >>>>> Chenyu >>>>> >>>>>> >>>>>> Michal >>>>> >>>> >>> >>