Hi,
在 2025/09/11 20:22, Meir Elisha 写道:
On 09/09/2025 4:00, Yu Kuai wrote:
Hi,
在 2025/09/08 22:08, Meir Elisha 写道:
When a RAID array is recovering and sync_action is set to "frozen",
the recovery process hangs indefinitely. This occurs because
wait_event() calls in md_do_sync() were missing the MD_RECOVERY_INTR
check.
Signed-off-by: Meir Elisha <meir.elisha@xxxxxxxxxxx>
---
drivers/md/md.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1de550108756..1b14beef87fc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9475,7 +9475,8 @@ void md_do_sync(struct md_thread *thread)
)) {
/* time to update curr_resync_completed */
wait_event(mddev->recovery_wait,
- atomic_read(&mddev->recovery_active) == 0);
+ atomic_read(&mddev->recovery_active) == 0 ||
+ test_bit(MD_RECOVERY_INTR, &mddev->recovery));
mddev->curr_resync_completed = j;
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) &&
j > mddev->resync_offset)
@@ -9581,7 +9582,8 @@ void md_do_sync(struct md_thread *thread)
* The faster the devices, the less we wait.
*/
wait_event(mddev->recovery_wait,
- !atomic_read(&mddev->recovery_active));
+ !atomic_read(&mddev->recovery_active) ||
+ test_bit(MD_RECOVERY_INTR, &mddev->recovery));
}
}
}
@@ -9592,7 +9594,8 @@ void md_do_sync(struct md_thread *thread)
* this also signals 'finished resyncing' to md_stop
*/
blk_finish_plug(&plug);
- wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
+ wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active) ||
+ test_bit(MD_RECOVERY_INTR, &mddev->recovery));
if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
This patch doesn't make sense, recovery_active should be zero when all
resync IO are done. MD_RECOVERY_INTR just tell sycn_thread to stop
issuing new sync IO.
Thanks,
Kuai
Hi Kuai
Reproduced this issue:
30511.653859] INFO: task md_vol0000001_6:9483 blocked for more than 622 seconds.
[30511.654079] Not tainted 5.14.0-503.31.1.el9_5.x86_64 #1
[30511.654321] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[30511.654550] task:md_vol0000001_6 state:D stack:0 pid:9483 tgid:9483 ppid:2 flags:0x00004000
[30511.654864] Call Trace:
[30511.655015] <TASK>
[30511.655165] __schedule+0x229/0x550
[30511.655339] ? srso_alias_return_thunk+0x5/0xfbef5
[30511.655514] schedule+0x2e/0xd0
[30511.655667] md_do_sync.cold+0x98d/0x98f
[30511.655820] ? __pfx_autoremove_wake_function+0x10/0x10
[30511.655976] ? __pfx_md_thread+0x10/0x10
[30511.656125] md_thread+0xab/0x160
[30511.656291] ? __pfx_md_thread+0x10/0x10
[30511.656432] kthread+0xe0/0x100
[30511.656577] ? __pfx_kthread+0x10/0x10
[30511.656718] ret_from_fork+0x2c/0x50
[30511.656862] </TASK>
[30511.657003] INFO: task bash:9559 blocked for more than 622 seconds.
[30511.657150] Not tainted 5.14.0-503.31.1.el9_5.x86_64 #1
[30511.657312] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[30511.657475] task:bash state:D stack:0 pid:9559 tgid:9559 ppid:6631 flags:0x00004006
[30511.657759] Call Trace:
[30511.657895] <TASK>
[30511.658026] __schedule+0x229/0x550
[30511.658160] schedule+0x2e/0xd0
[30511.658320] stop_sync_thread+0xf2/0x190
[30511.658466] ? __pfx_autoremove_wake_function+0x10/0x10
[30511.658606] action_store+0x103/0x2f0
[30511.658743] md_attr_store+0x83/0x100
[30511.658883] kernfs_fop_write_iter+0x12b/0x1c0
[30511.659026] vfs_write+0x2ce/0x410
[30511.659169] ksys_write+0x5f/0xe0
[30511.659332] do_syscall_64+0x5f/0xf0
Debugging showed we hanged in the wait_event() call in md_do_sync().
If adding MD_RECOVERY_INTR to the wait condition is a bad idea,
Do you see can we prevent this hang?
Thanks for your feedback!
The first step is to figure out why recovery_active is not zero, usually
it'll be zero when all the rescyn IO are done, so you need to find if
they stuck somewhere, or there is no resync IO inflight and the counter
is leaked.
And you should probably try latest kernel first.
Thanks,
Kuai
.