Re: Realtime threads delayed due to kcompactd0

Frank van der Linden <fvdl@xxxxxxxxxx> · Thu, 31 Jul 2025 11:34:50 -0700

On Thu, Jul 24, 2025 at 10:30 PM Alexander Krabler
<Alexander.Krabler@xxxxxxxx> wrote:
>
> Hi all,
>
> some of our realtime tasks get delayed from time to time due to activity of kcompactd0.
> Out of nothing, realtime tasks go into uninterruptable sleep for some time.
> This delay can be as much as 1.1ms, which is not acceptable for us.
>
> Our hardware is an aarch64-based SOC with 8 A72 cores, kernel is 6.12.17 with PREEMPT_RT.
> We have CONFIG_COMPACTION and CONFIG_MIGRATION enabled.
>
> Here are some snippets from ftrace:
>              kcompactd0-88    [001] 13112.100041: mm_compaction_begin:  zone_start=0x80000 migrate_pfn=0x80000 free_pfn=0xffe00 zone_end=0x100000, mode=sync
> ...
>              kcompactd0-88    [001] 13112.159782: mm_compaction_isolate_migratepages: range=(0x85800 ~ 0x85841) nr_scanned=65 nr_taken=32
>              kcompactd0-88    [001] 13112.159810: mm_compaction_isolate_freepages: range=(0xddc40 ~ 0xddc48) nr_scanned=8 nr_taken=8
>              kcompactd0-88    [001] 13112.160002: irq_handler_entry:    irq=11 name=arch_timer
>              kcompactd0-88    [001] 13112.160012: irq_handler_exit:     irq=11 ret=handled
>              kcompactd0-88    [001] 13112.160121: mm_compaction_migratepages: nr_migrated=32 nr_failed=0
>              kcompactd0-88    [001] 13112.160122: mm_compaction_finished: node=0 zone=DMA      order=-1 ret=continue
>              kcompactd0-88    [001] 13112.160185: mm_compaction_isolate_migratepages: range=(0x85841 ~ 0x85a00) nr_scanned=447 nr_taken=166
>              kcompactd0-88    [001] 13112.160204: mm_compaction_isolate_freepages: range=(0xddc48 ~ 0xddd80) nr_scanned=312 nr_taken=196
>               tRealtime-16499 [004] 13112.160511: sched_switch:         tRealtime:16499 [25] D ==> tKRC:16479 [39]
>               tRealtime-16499 [004] 13112.160512: kernel_stack:         <stack trace >
> => __schedule (ffffcde843022d6c)
> => schedule (ffffcde843023464)
> => io_schedule (ffffcde8430235ec)
> => migration_entry_wait_on_locked (ffffcde8424a1ad8)
> => migration_entry_wait (ffffcde84254c400)
> => do_swap_page (ffffcde8424f7fac)
> => __handle_mm_fault (ffffcde8424f8b64)
> => handle_mm_fault (ffffcde8424f9bc0)
> => do_page_fault (ffffcde843030380)
> => do_translation_fault (ffffcde84303072c)
> => do_mem_abort (ffffcde84222f674)
> => el0_ia (ffffcde84301eb20)
> => el0t_64_sync_handler (ffffcde84301f020)
> => el0t_64_sync (ffffcde842211514)
>              kcompactd0-88    [001] 13112.160557: sched_pi_setprio:     comm=kcompactd0 pid=88 oldprio=39 newprio=120
>              kcompactd0-88    [001] 13112.160569: sched_waking:         comm=tKRC pid=16479 prio=39 target_cpu=004
>              kcompactd0-88    [001] 13112.160986: sched_waking:         comm=tKRC pid=16479 prio=39 target_cpu=004
>              kcompactd0-88    [001] 13112.161412: sched_waking:         comm=tOther pid=16520 prio=40 target_cpu=004
>              kcompactd0-88    [001] 13112.161457: sched_pi_setprio:     comm=kcompactd0 pid=88 oldprio=40 newprio=120
>              kcompactd0-88    [001] 13112.161465: sched_waking:         comm=tOther pid=16520 prio=40 target_cpu=004
>              kcompactd0-88    [001] 13112.161654: sched_waking:         comm=tRealtime pid=16499 prio=25 target_cpu=004
>
> In our setup kcompactd0 gets enough CPU time (on core 1), however, it seems strange that it doesn't get the priority inherited from blocked realtime tasks.
> (It does for short amounts of time, which seems to be due to the locks inside migration_entry_wait_on_locked.)
>
> Is there anything we can do here?
>
> Thanks,
> Alexander

Yes, we have (likely) seen this issue too, in a !CONFIG_PREEMPT setting.

The basic problem is that the calling thread (kcompactd or it could be
any thread that goes in to direct compaction) creates a resource that
needs to be waited for until it's done, in the form of the migration
PTEs. Since a migration PTE is not a lock that is held by the thread
doing the migration, there is no priority inheritance in the realtime
case, and priority inversion can happen.

This issue has always been there, but it has been made more prominent
with batch migration. With batch migration, all migration PTEs are set
up in the first step, followed by a TLB flush, and then the copy / new
map setup is done. So, the migration PTEs stick around for longer, and
the chance that other threads block on them is higher. For the
!CONFIG_PREEMPT case, the cond_resched() in the loop can also cause
the thread creating the migration PTEs to be descheduled while a
number of migration PTEs are in place, so there is a similar priority
inversion chance.

Not sure what the right thing to do would be. Either explicitly boost
the priority of a thread temporarily during migrate_pages_batch, or
mitigate the issue by dealing with 'busy' pages more quickly in
migrate_pages_batch.

- Frank