On 2025/08/13 9:59, Yu Kuai wrote: > Hi, > > 在 2025/08/12 17:01, Kenta Akagi 写道: >> It is not intended for the array to fail when a metadata write with >> MD_FAILFAST fails. >> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), >> when md_error is called on the last device in RAID1/10, >> the MD_BROKEN flag is set on the array. >> Because of this, a failfast metadata write failure will >> make the array "broken" state. >> >> If rdev is not Faulty even after calling md_error, >> the rdev is the last device, and there is nothing except >> MD_BROKEN that prevents writes to the array. >> Therefore, by clearing MD_BROKEN, the array will not become >> "broken" after a failfast metadata write failure. > > I don't understand here, I think MD_BROKEN is expected, the last > rdev has IO error while updating metadata, the array is now broken > and you can only read it afterwards. Allow using this broken array > read-write might causing more severe problem like data loss. > Thank you for reviewing. I think that only when the bio has the MD_FAILFAST flag, a metadata write failure to the last rdev should not make it broken array at that point. This is because a metadata write with MD_FAILFAST is retried after failure as follows: 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags. 2. In md_super_wait, which is called by the function that executed md_super_write and waits for completion, -EAGAIN is returned because MD_SB_NEED_REWRITE is set. 3. The caller of md_super_wait (such as md_update_sb) receives a negative return value and then retries md_super_write. 4. The md_super_write function, which is called to perform the same metadata write, issues a write bio without MD_FAILFAST this time, because the rdev has LastDev flag. When a bio from super_written without MD_FAILFAST fails, the array is truly broken, and MD_BROKEN should be set. A failfast bio, for example in the case of nvme-tcp , will fail immediately if the connection to the target is lost for a few seconds and the device enters a reconnecting state - even though it would recover if given a few seconds. This behavior is exactly as intended by the design of failfast. However, md treats super_write operations fails with failfast as fatal. For example, if an initiator - that is, a machine loading the md module - loses all connections for a few seconds, the array becomes broken and subsequent write is no longer possible. This is the issue I am currently facing, and which this patch aims to fix. Should I add more context to the commit message? Please advise. Thanks, AKAGI > Thanks, > Kuai > >> >> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") >> Signed-off-by: Kenta Akagi <k@xxxxxxx> >> --- >> drivers/md/md.c | 1 + >> drivers/md/md.h | 2 +- >> 2 files changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index ac85ec73a409..3ec4abf02fa0 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio) >> md_error(mddev, rdev); >> if (!test_bit(Faulty, &rdev->flags) >> && (bio->bi_opf & MD_FAILFAST)) { >> + clear_bit(MD_BROKEN, &mddev->flags); >> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); >> set_bit(LastDev, &rdev->flags); >> } >> diff --git a/drivers/md/md.h b/drivers/md/md.h >> index 51af29a03079..2f87bcc5d834 100644 >> --- a/drivers/md/md.h >> +++ b/drivers/md/md.h >> @@ -332,7 +332,7 @@ struct md_cluster_operations; >> * resync lock, need to release the lock. >> * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as >> * calls to md_error() will never cause the array to >> - * become failed. >> + * become failed while fail_last_dev is not set. >> * @MD_HAS_PPL: The raid array has PPL feature set. >> * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set. >> * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that >> > >