On Thu, Jul 24, 2025 at 10:30 PM Alexander Krabler <Alexander.Krabler@xxxxxxxx> wrote: > > Hi all, > > some of our realtime tasks get delayed from time to time due to activity of kcompactd0. > Out of nothing, realtime tasks go into uninterruptable sleep for some time. > This delay can be as much as 1.1ms, which is not acceptable for us. > > Our hardware is an aarch64-based SOC with 8 A72 cores, kernel is 6.12.17 with PREEMPT_RT. > We have CONFIG_COMPACTION and CONFIG_MIGRATION enabled. > > Here are some snippets from ftrace: > kcompactd0-88 [001] 13112.100041: mm_compaction_begin: zone_start=0x80000 migrate_pfn=0x80000 free_pfn=0xffe00 zone_end=0x100000, mode=sync > ... > kcompactd0-88 [001] 13112.159782: mm_compaction_isolate_migratepages: range=(0x85800 ~ 0x85841) nr_scanned=65 nr_taken=32 > kcompactd0-88 [001] 13112.159810: mm_compaction_isolate_freepages: range=(0xddc40 ~ 0xddc48) nr_scanned=8 nr_taken=8 > kcompactd0-88 [001] 13112.160002: irq_handler_entry: irq=11 name=arch_timer > kcompactd0-88 [001] 13112.160012: irq_handler_exit: irq=11 ret=handled > kcompactd0-88 [001] 13112.160121: mm_compaction_migratepages: nr_migrated=32 nr_failed=0 > kcompactd0-88 [001] 13112.160122: mm_compaction_finished: node=0 zone=DMA order=-1 ret=continue > kcompactd0-88 [001] 13112.160185: mm_compaction_isolate_migratepages: range=(0x85841 ~ 0x85a00) nr_scanned=447 nr_taken=166 > kcompactd0-88 [001] 13112.160204: mm_compaction_isolate_freepages: range=(0xddc48 ~ 0xddd80) nr_scanned=312 nr_taken=196 > tRealtime-16499 [004] 13112.160511: sched_switch: tRealtime:16499 [25] D ==> tKRC:16479 [39] > tRealtime-16499 [004] 13112.160512: kernel_stack: <stack trace > > => __schedule (ffffcde843022d6c) > => schedule (ffffcde843023464) > => io_schedule (ffffcde8430235ec) > => migration_entry_wait_on_locked (ffffcde8424a1ad8) > => migration_entry_wait (ffffcde84254c400) > => do_swap_page (ffffcde8424f7fac) > => __handle_mm_fault (ffffcde8424f8b64) > => handle_mm_fault (ffffcde8424f9bc0) > => do_page_fault (ffffcde843030380) > => do_translation_fault (ffffcde84303072c) > => do_mem_abort (ffffcde84222f674) > => el0_ia (ffffcde84301eb20) > => el0t_64_sync_handler (ffffcde84301f020) > => el0t_64_sync (ffffcde842211514) > kcompactd0-88 [001] 13112.160557: sched_pi_setprio: comm=kcompactd0 pid=88 oldprio=39 newprio=120 > kcompactd0-88 [001] 13112.160569: sched_waking: comm=tKRC pid=16479 prio=39 target_cpu=004 > kcompactd0-88 [001] 13112.160986: sched_waking: comm=tKRC pid=16479 prio=39 target_cpu=004 > kcompactd0-88 [001] 13112.161412: sched_waking: comm=tOther pid=16520 prio=40 target_cpu=004 > kcompactd0-88 [001] 13112.161457: sched_pi_setprio: comm=kcompactd0 pid=88 oldprio=40 newprio=120 > kcompactd0-88 [001] 13112.161465: sched_waking: comm=tOther pid=16520 prio=40 target_cpu=004 > kcompactd0-88 [001] 13112.161654: sched_waking: comm=tRealtime pid=16499 prio=25 target_cpu=004 > > In our setup kcompactd0 gets enough CPU time (on core 1), however, it seems strange that it doesn't get the priority inherited from blocked realtime tasks. > (It does for short amounts of time, which seems to be due to the locks inside migration_entry_wait_on_locked.) > > Is there anything we can do here? > > Thanks, > Alexander Yes, we have (likely) seen this issue too, in a !CONFIG_PREEMPT setting. The basic problem is that the calling thread (kcompactd or it could be any thread that goes in to direct compaction) creates a resource that needs to be waited for until it's done, in the form of the migration PTEs. Since a migration PTE is not a lock that is held by the thread doing the migration, there is no priority inheritance in the realtime case, and priority inversion can happen. This issue has always been there, but it has been made more prominent with batch migration. With batch migration, all migration PTEs are set up in the first step, followed by a TLB flush, and then the copy / new map setup is done. So, the migration PTEs stick around for longer, and the chance that other threads block on them is higher. For the !CONFIG_PREEMPT case, the cond_resched() in the loop can also cause the thread creating the migration PTEs to be descheduled while a number of migration PTEs are in place, so there is a similar priority inversion chance. Not sure what the right thing to do would be. Either explicitly boost the priority of a thread temporarily during migrate_pages_batch, or mitigate the issue by dealing with 'busy' pages more quickly in migrate_pages_batch. - Frank