The jbd2 handle, associated with filesystem metadata, can be held during direct reclaim when a memcg limit is hit. This prevents other tasks from writing pages, resulting in shrink failures due to dirty pages that cannot be written back. These shrink failures may leave many tasks stuck in the uninterruptible (D) state. The OOM killer may select a victim and return success, allowing the current thread to retry the memory charge. However, the selected task cannot respond to the SIGKILL because it is also stuck in the uninterruptible state. As a result, the charging task resets nr_retries and attempts reclaim again, but the victim never exits. This leads to a prolonged retry loop in direct reclaim with the jbd2 handle held, significantly extending its hold time and potentially causing a system-wide block. We found that a related issue has been reported and partially addressed in previous fixes [1][2]. However, those fixes only skip direct reclaim and return a failure for some cases like readahead requests. Since sb_getblk() is called multiple times in __ext4_get_inode_loc() with the NOFAIL flag, the problem still persists. So call the memalloc_account_force_save() to charge the pages and delay the direct reclaim util return to userland, to release the global resource jbd2 handle. [1]:https://lore.kernel.org/linux-fsdevel/20230811071519.1094-1-teawaterz@xxxxxxxxxxxxxxxxx/ [2]:https://lore.kernel.org/all/20230914150011.843330-1-willy@xxxxxxxxxxxxx/T/#u Co-developed-by: Muchun Song <songmuchun@xxxxxxxxxxxxx> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx> Signed-off-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> --- fs/jbd2/transaction.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index c7867139af69..d05847301a8f 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -448,6 +448,13 @@ static int start_this_handle(journal_t *journal, handle_t *handle, * going to recurse back to the fs layer. */ handle->saved_alloc_context = memalloc_nofs_save(); + + /* + * Avoid blocking on jbd2 handler in memcg direct reclaim + * which may otherwise lead to system-wide stalls. + */ + handle->saved_alloc_context |= memalloc_account_force_save(); + return 0; } @@ -733,10 +740,10 @@ static void stop_this_handle(handle_t *handle) rwsem_release(&journal->j_trans_commit_map, _THIS_IP_); /* - * Scope of the GFP_NOFS context is over here and so we can restore the - * original alloc context. + * Scope of the GFP_NOFS and PF_MEMALLOC_ACCOUNTFORCE context + * is over here and so we can restore the original alloc context. */ - memalloc_nofs_restore(handle->saved_alloc_context); + memalloc_flags_restore(handle->saved_alloc_context); } /** @@ -1838,7 +1845,7 @@ int jbd2_journal_stop(handle_t *handle) * Handle is already detached from the transaction so there is * nothing to do other than free the handle. */ - memalloc_nofs_restore(handle->saved_alloc_context); + memalloc_flags_restore(handle->saved_alloc_context); goto free_and_exit; } journal = transaction->t_journal; -- 2.39.5