On Mon 28 Jul 2025 at 12:21, Heming Zhao <heming.zhao@xxxxxxxx>
wrote:
Commit a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for
HOT_REMOVE_DISK ioctl") introduced a regression in the
md_cluster
module. (Failed cases 02r1_Manage_re-add & 02r10_Manage_re-add)
Consider a 2-node cluster:
- node1 set faulty & remove command on a disk.
- node2 must correctly update the array metadata.
Before a1fd37f97808, on node1, the delay between
msg:METADATA_UPDATED
(triggered by faulty) and msg:REMOVE was sufficient for node2 to
reload the disk info (written by node1).
After a1fd37f97808, node1 no longer waits between faulty and
remove,
causing it to send msg:REMOVE while node2 is still reloading
disk info.
This often results in node2 failing to remove the faulty disk.
== how to trigger ==
set up a 2-node cluster (node1 & node2) with disks vdc & vdd.
on node1:
mdadm -CR /dev/md0 -l1 -b clustered -n2 /dev/vdc /dev/vdd
--assume-clean
ssh node2-ip mdadm -A /dev/md0 /dev/vdc /dev/vdd
mdadm --manage /dev/md0 --fail /dev/vdc --remove /dev/vdc
check array status on both nodes with "mdadm -D /dev/md0".
node1 output:
Number Major Minor RaidDevice State
- 0 0 0 removed
1 254 48 1 active sync /dev/vdd
node2 output:
Number Major Minor RaidDevice State
- 0 0 0 removed
1 254 48 1 active sync /dev/vdd
0 254 32 - faulty /dev/vdc
Fixes: a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for
HOT_REMOVE_DISK ioctl")
Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx>
---
drivers/md/md.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0f03b21e66e4..de9f9a345ed3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9758,8 +9758,8 @@ void md_check_recovery(struct mddev
*mddev)
* remove disk.
*/
rdev_for_each_safe(rdev, tmp, mddev) {
- if (test_and_clear_bit(ClusterRemove,
&rdev->flags) &&
- rdev->raid_disk < 0)
ClusterRemove and MD_RECOVERY_NEEDED are set in removed
rdev->flags on node 2 by process_remove_disk().
After commit a1fd37f97808, node 2 could receive REMOVE in short
time after receive of
METADATA_UPDATED, MD_RECOVERY_NEEDED will be cleared by
md_check_recovery() but in that time
rdev->raid_disk could be unchanged. Then MD_RECOVERY_NEEDED will
be cleared in the end of md_check_recovery()
no more chance to meet the condition.
Reviewed-by: Su Yue <glass.su@xxxxxxxx>
+ if (rdev->raid_disk < 0 &&
+ test_and_clear_bit(ClusterRemove,
&rdev->flags))
md_kick_rdev_from_array(rdev);
}
}
@@ -10065,8 +10065,11 @@ static void check_sb_changes(struct
mddev *mddev, struct md_rdev *rdev)
/* Check for change of roles in the active devices */
rdev_for_each_safe(rdev2, tmp, mddev) {
- if (test_bit(Faulty, &rdev2->flags))
+ if (test_bit(Faulty, &rdev2->flags)) {
+ if (test_bit(ClusterRemove, &rdev2->flags))
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
continue;
+ }
/* Check if the roles changed */
role = le16_to_cpu(sb->dev_roles[rdev2->desc_nr]);