Re: [PATCH 0/2] Postpone memcg reclaim to return-to-user path

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Wed, 18 Jun 2025 15:37:20 -0700

Hi Zhongkun,

Thanks for a very detailed and awesome description of the problem. This
is a real issue and we at Meta face similar scenarios as well. However I
would not go for PF_MEMALLOC_ACFORCE approach as it is easy to abuse and
is very manual and requires detecting the code which can cause such
scenarios and then case-by-case opting them in. I would prefer a dynamic
or automated approach where the kernel detects such an issue is
happening and recover from it. Though a case can be made where we avoid
such scenarios from happening but that might not be possible everytime.

Also this is very memcg specific, I can clearly see the same scenario
can happen for global reclaim as well.

I have a couple of questions below:

On Wed, Jun 18, 2025 at 07:39:56PM +0800, Zhongkun He wrote:
> # Introduction
> 
> This patchset aims to introduce an approach to ensure that memory
> allocations are forced to be accounted to the memory cgroup, even if
> they exceed the cgroup's maximum limit. In such cases, the reclaim
> process is postponed until the task returns to the user.

This breaks memory.max semantics. Any reason memory.high is not used
here. Basically instead of memory.max, use memory.high as job limit. I
would like to know how memory.high is lacking for your use-case. Maybe
we can fix that or introduce a new form of limit. However this is memcg
specific and will not resolve the global reclaim case.

> This is
> beneficial for users who perform over-max reclaim while holding multiple
> locks or other resources (especially resources related to file system
> writeback). If a task needs any of these resources, it would otherwise
> have to wait until the other task completes reclaim and releases the
> resources. Postponing reclaim to the return-to-user path helps avoid this issue.
> 
> # Background
> 
> We have been encountering an hungtask issue for a long time. Specifically,
> when a task holds the jbd2 handler 

Can you explain a bit more about jbd2 handler? Is it some global shared
lock or a workqueue which can only run single thread at a time.
Basically is there a way to get the current holder/owner of jbd2 handler
programmatically?

> and subsequently enters direct reclaim
> because it reaches the hard limit within a memory cgroup, the system may become
> blocked for a long time. The stack trace of waiting thread holding the jbd2
> handle is as follows (and so many other threads are waiting on the same jbd2
> handle):
> 
>  #0 __schedule at ffffffff97abc6c9
>  #1 preempt_schedule_common at ffffffff97abcdaa
>  #2 __cond_resched at ffffffff97abcddd
>  #3 shrink_active_list at ffffffff9744dca2
>  #4 shrink_lruvec at ffffffff97451407
>  #5 shrink_node at ffffffff974517c9
>  #6 do_try_to_free_pages at ffffffff97451dae
>  #7 try_to_free_mem_cgroup_pages at ffffffff974542b8
>  #8 try_charge_memcg at ffffffff974f0ede
>  #9 charge_memcg at ffffffff974f1d0e
> #10 __mem_cgroup_charge at ffffffff974f391c
> #11 __add_to_page_cache_locked at ffffffff974313e5
> #12 add_to_page_cache_lru at ffffffff974324b2
> #13 pagecache_get_page at ffffffff974338e3
> #14 __getblk_gfp at ffffffff97556798
> #15 __ext4_get_inode_loc at ffffffffc07a5518 [ext4]
> #16 ext4_get_inode_loc at ffffffffc07a7fec [ext4]
> #17 ext4_reserve_inode_write at ffffffffc07a9fb1 [ext4]
> #18 __ext4_mark_inode_dirty at ffffffffc07aa249 [ext4]
> #19 __ext4_new_inode at ffffffffc079cbae [ext4]
> #20 ext4_create at ffffffffc07c3e56 [ext4]
> #21 path_openat at ffffffff9751f471
> #22 do_filp_open at ffffffff97521384
> #23 do_sys_openat2 at ffffffff97508fd6
> #24 do_sys_open at ffffffff9750a65b
> #25 do_syscall_64 at ffffffff97aaed14
> 
> We've obtained a coredump and dumped struct scan_control from it by using crash tool.
> 
> struct scan_control {
>   nr_to_reclaim = 32,
>   order = 0 '\000',
>   priority = 1 '\001',
>   reclaim_idx = 4 '\004',
>   gfp_mask = 17861706, __GFP_NOFAIL
>   nr_scanned = 27810, 
>   nr_reclaimed = 0,
>   nr = {
>         dirty = 27797,
>         unqueued_dirty = 27797,
>         congested = 0,
>         writeback = 0,
>         immediate = 0,
>         file_taken = 27810,
>         taken = 27810
>   },
> }
> 

What is the kernel version? Can you run scripts/gfp-translate on the
gfp_mask above? Does this kernel have a75ffa26122b ("memcg, oom: do not
bypass oom killer for dying tasks")?

> The ->nr_reclaimed is zero meaning there is no memory we have reclaimed because
> most of the file pages are unqueued dirty. And ->priority is 1 also meaning we
> spent so much time on memory reclamation.

Is there a way to get how many times this thread has looped within
try_charge_memcg()?

> Since this thread has held the jbd2
> handler, the jbd2 thread was waiting for the same jbd2 handler, which blocked
> so many other threads from writing dirty pages as well.
> 
> 0 [] __schedule at ffffffff97abc6c9
> 1 [] schedule at ffffffff97abcd01
> 2 [] jbd2_journal_wait_updates at ffffffffc05a522f [jbd2]
> 3 [] jbd2_journal_commit_transaction at ffffffffc05a72c6 [jbd2]
> 4 [] kjournald2 at ffffffffc05ad66d [jbd2]
> 5 [] kthread at ffffffff972bc4c0
> 6 [] ret_from_fork at ffffffff9720440f
> 
> Furthermore, we observed that memory usage far exceeded the configured memory maximum,
> reaching around 38GB.
> 
> memory.max  : 134896020    514 GB
> memory.usage: 144747169    552 GB

This is unexpected and most probably our hacks to allow overcharge to
avoid similar situations are causing this. 

> 
> We investigated this issue and identified the root cause:
>   try_charge_memcg:
>     retry charge
>      charge failed
>        -> direct reclaim
>         -> mem_cgroup_oom    return true，but selected task is in an uninterruptible state
>            -> retry charge

Oh oom reaper didn't help?

> 
> In which cases, we saw many tasks in the uninterruptible (D) state with a pending
> SIGKILL signal. The OOM killer selects a victim and returns success, allowing the
> current thread to retry the memory charge. However, the selected task cannot acknowledge
> the SIGKILL signal because it is stuck in an uninterruptible state.

OOM reaper usually helps in such cases but I see below why it didn't
help.

> As a result,
> the charging task resets nr_retries and attempts to reclaim again, but the victim
> task never exits. This causes the current thread to enter a prolonged retry loop
> during direct reclaim, holding the jbd2 handler for much more time and leading to
> system-wide blocking. Why are there so many uninterruptible (D) state tasks?
> Check the most common stack trace.
> 
> crash> task_struct.__state ffff8c53a15b3080
>   __state = 2,   #define TASK_UNINTERRUPTIBLE        0x0002
>  0 [] __schedule at ffffffff97abc6c9
>  1 [] schedule at ffffffff97abcd01
>  2 [] schedule_preempt_disabled at ffffffff97abdf1a
>  3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
>  4 [] down_read at ffffffff97ac06b1
>  5 [] do_user_addr_fault at ffffffff9727f1e7
>  6 [] exc_page_fault at ffffffff97ab286e
>  7 [] asm_exc_page_fault at ffffffff97c00d42
> 
> Check the owner of mm_struct.mmap_lock. The task below was entering memory reclaim
> holding mmap lock and there are 68 tasks in this memory cgroup, with 23 of them in
> the memory reclaim context.
> 

The following thread has mmap_lock in write mode and thus oom-reaper is
not helping. Do you see "oom_reaper: unable to reap pid..." messages in
dmesg?

>  7 [] shrink_active_list at ffffffff9744dd46
>  8 [] shrink_lruvec at ffffffff97451407
>  9 [] shrink_node at ffffffff974517c9
> 10 [] do_try_to_free_pages at ffffffff97451dae
> 11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
> 12 [] try_charge_memcg at ffffffff974f0ede
> 13 [] obj_cgroup_charge_pages at ffffffff974f1dae
> 14 [] obj_cgroup_charge at ffffffff974f2fc2
> 15 [] kmem_cache_alloc at ffffffff974d054c
> 16 [] vm_area_dup at ffffffff972923f1
> 17 [] __split_vma at ffffffff97486c16
> 18 [] __do_munmap at ffffffff97486e78
> 19 [] __vm_munmap at ffffffff97487307
> 20 [] __x64_sys_munmap at ffffffff974873e7
> 21 [] do_syscall_64 at ffffffff97aaed14
> 
> Many threads was entering the memory reclaim in UN state, other threads was blocking
> on mmap_lock. Although the OOM killer selects a victim, it cannot terminate it.

Can you please confirm the above? Is the kernel able to oom-kill more
processes or if it is returning early because the current thread is
dying. However if the cgroup has just one big process, this doesn't
matter.

> The
> task holding the jbd2 handle retries memory charge, but it fails. Reclaiming continues
> while holding the jbd2 handler. write_pages also fails while waiting for the same jbd2
> handler, causing repeated shrink failures and potentially leading to a system-wide block.
> 
> ps | grep UN | wc -l
> 1463
> 
> While the system has 1463 UN state tasks, so the way to break this akin to "deadlock" is
> to let the thread holding jbd2 handler quickly exit the memory reclamation process.
> 
> We found that a related issue was reported and partially fixed in previous patches [1][2].
> However, those fixes only skip direct reclamation and return a failure for some cases such
> as readahead requests. As sb_getblk() is called multiple times in __ext4_get_inode_loc()
> with the NOFAIL flag, the problem still exists. And it is not feasible to simply remove
> __GFP_RECLAIMABLE when holding jbd2 handle to avoid potential very long memory reclaim
> latency,  as __GFP_NOFAIL is not supported without __GFP_DIRECT_RECLAIM.
> 
> # Fundamentals
> 
> This patchset introduces a new task flag of PF_MEMALLOC_ACFORCE to indicate that memory
> allocations are forced to be accounted to the memory cgroup, even if they exceed the cgroup's
> maximum limit. The reclaim process is deferred until the task returns to the user without
> holding any kernel resources for memory reclamation, thereby preventing priority inversion
> problems. Any users who might encounter potential similar issues can utilize this new flag
> to allocate memory and prevent long-term latency for the entire system.

I already explained upfront why this is not the approach we want.

We do see similar situations/scenarios but due to global/shared locks in
btrfs but I expect any global lock or global shared resource can cause
such priority inversion situations.