On 2025/08/19 0:45, Yu Kuai wrote: > Hi, > > 在 2025/8/18 20:48, Kenta Akagi 写道: >> On 2025/08/18 11:48, Yu Kuai wrote: >>> Hi, >>> >>> 在 2025/08/18 10:05, Yu Kuai 写道: >>>> Hi, >>>> >>>> 在 2025/08/18 1:27, Kenta Akagi 写道: >>>>> A super_write IO failure with MD_FAILFAST must not cause the array >>>>> to fail. >>>>> >>>>> Because a failfast bio may fail even when the rdev is not broken, >>>>> so IO must be retried rather than failing the array when a metadata >>>>> write with MD_FAILFAST fails on the last rdev. >>>> Why just last rdev? If failfast can fail when the rdev is not broken, I >>>> feel we should retry for all the rdev. >> Thank you for reviewing. >> >> The reason this retry applies only to the last rdev is that the purpose >> of failfast is to quickly detach a faulty device and thereby minimize >> mdev IO latency on rdev failure. >> If md retries all rdevs, the Faulty handler will no longer act >> quickly enough, which will always "cause long delays" [1]. >> I believe this is not the behavior users want. >> >> [1] https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/mdadm.8.in?h=main&id=34f21b7acea8afbea9348d0f421beeeedca7a136#n784 >> >>> BTW, I couldn't figure out the reason, why failfast is added for the >>> meta write. I do feel just remove this flag for metadata write will fix >>> this problem. >> By issuing metadata writes with failfast in md, it becomes possible to >> detect rdev failures quickly. >> Most applications never issue IO with the REQ_FAILFAST flag set, >> so if md issues its metadata writes without failfast, >> rdev failures would not be detected quickly. >> This would undermine the point of the md's failfast feature. >> And this would also "cause long delays" [1]. >> I believe this is also not what users want. > > Yes, this make sense. But I was thinking failfast will work on normal IO, > not metadata IO like updating superblock, which doesn't happen quite often > for user. But consider we have this behavior for such a long time, I agree > we'd better not change it. Thank you for reviewing. Sorry, I forgot to mention that in my environment bitmap is enabled, so super_write is called frequently. Also, what I said earlier was incorrect. md actively attaches MD_FAILFAST not only for metadata writes but also for other requests. Therefore, the current patch is insufficient to achieve the prevent last rdev fails by failfast. I will submit a new patch with the fix. Thanks, Akagi > >>> Thanks, >>> Kuai >>> >>>>> A metadata write with MD_FAILFAST is retried after failure as >>>>> follows: >>>>> >>>>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags. >>>>> >>>>> 2. In md_super_wait, which is called by the function that >>>>> executed md_super_write and waits for completion, >>>>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set. >>>>> >>>>> 3. The caller of md_super_wait (such as md_update_sb) >>>>> receives a negative return value and then retries md_super_write. >>>>> >>>>> 4. The md_super_write function, which is called to perform >>>>> the same metadata write, issues a write bio without MD_FAILFAST >>>>> this time. >>>>> >>>>> When a write from super_written without MD_FAILFAST fails, >>>>> the array may broken, and MD_BROKEN should be set. >>>>> >>>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"), >>>>> calling md_error on the last rdev in RAID1/10 always sets >>>>> the MD_BROKEN flag on the array. >>>>> As a result, when failfast IO fails on the last rdev, the array >>>>> immediately becomes failed. >>>>> >>>>> This commit prevents MD_BROKEN from being set when a super_write with >>>>> MD_FAILFAST fails on the last rdev, ensuring that the array does >>>>> not become failed due to failfast IO failures. >>>>> >>>>> Failfast IO failures on any rdev except the last one are not retried >>>>> and are marked as Faulty immediately. This minimizes array IO latency >>>>> when an rdev fails. >>>>> >>>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10") >>>>> Signed-off-by: Kenta Akagi <k@xxxxxxx> >>>>> --- >>>>> drivers/md/md.c | 9 ++++++--- >>>>> drivers/md/md.h | 7 ++++--- >>>>> drivers/md/raid1.c | 12 ++++++++++-- >>>>> drivers/md/raid10.c | 12 ++++++++++-- >>>>> 4 files changed, 30 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>>>> index ac85ec73a409..61a8188849a3 100644 >>>>> --- a/drivers/md/md.c >>>>> +++ b/drivers/md/md.c >>>>> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio) >>>>> if (bio->bi_status) { >>>>> pr_err("md: %s gets error=%d\n", __func__, >>>>> blk_status_to_errno(bio->bi_status)); >>>>> + if (bio->bi_opf & MD_FAILFAST) >>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>> I think it's better to retry the bio with the flag cleared, then all >>>> underlying procedures can stay the same. >> That might be a better approach. I'll check the call hierarchy and lock dependencies. > > You might need to add a new async work to resubmit this bio. > > Thanks, > Kuai > >> Thanks, >> Akagi >> >> >>>> Thanks, >>>> Kuai >>>> >>>>> md_error(mddev, rdev); >>>>> if (!test_bit(Faulty, &rdev->flags) >>>>> && (bio->bi_opf & MD_FAILFAST)) { >>>>> + pr_warn("md: %s: Metadata write will be repeated to %pg\n", >>>>> + mdname(mddev), rdev->bdev); >>>>> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); >>>>> - set_bit(LastDev, &rdev->flags); >>>>> } >>>>> } else >>>>> - clear_bit(LastDev, &rdev->flags); >>>>> + clear_bit(FailfastIOFailure, &rdev->flags); >>>>> bio_put(bio); >>>>> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev, >>>>> if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) && >>>>> test_bit(FailFast, &rdev->flags) && >>>>> - !test_bit(LastDev, &rdev->flags)) >>>>> + !test_bit(FailfastIOFailure, &rdev->flags)) >>>>> bio->bi_opf |= MD_FAILFAST; >>>>> atomic_inc(&mddev->pending_writes); >>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h >>>>> index 51af29a03079..cf989aca72ad 100644 >>>>> --- a/drivers/md/md.h >>>>> +++ b/drivers/md/md.h >>>>> @@ -281,9 +281,10 @@ enum flag_bits { >>>>> * It is expects that no bad block log >>>>> * is present. >>>>> */ >>>>> - LastDev, /* Seems to be the last working dev as >>>>> - * it didn't fail, so don't use FailFast >>>>> - * any more for metadata >>>>> + FailfastIOFailure, /* A device that failled a metadata write >>>>> + * with failfast. >>>>> + * error_handler must not fail the array >>>>> + * if last device has this flag. >>>>> */ >>>>> CollisionCheck, /* >>>>> * check if there is collision between raid1 >>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>>> index 408c26398321..fc7195e58f80 100644 >>>>> --- a/drivers/md/raid1.c >>>>> +++ b/drivers/md/raid1.c >>>>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>>>> * - recovery is interrupted. >>>>> * - &mddev->degraded is bumped. >>>>> * >>>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>>> - * &mddev->fail_last_dev is off. >>>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write >>>>> + * failed in failfast and will be retried, so the @mddev did not fail. >>>>> + * >>>>> + * @rdev is marked as &Faulty excluding any cases: >>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>>> */ >>>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>> { >>>>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>> if (test_bit(In_sync, &rdev->flags) && >>>>> (conf->raid_disks - mddev->degraded) == 1) { >>>>> + if (test_bit(FailfastIOFailure, &rdev->flags)) { >>>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>>> + return; >>>>> + } >>>>> set_bit(MD_BROKEN, &mddev->flags); >>>>> if (!mddev->fail_last_dev) { >>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>>>> index b60c30bfb6c7..ff105a0dcd05 100644 >>>>> --- a/drivers/md/raid10.c >>>>> +++ b/drivers/md/raid10.c >>>>> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int ignore) >>>>> * - recovery is interrupted. >>>>> * - &mddev->degraded is bumped. >>>>> * >>>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>>> - * &mddev->fail_last_dev is off. >>>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write >>>>> + * failed in failfast, so the @mddev did not fail. >>>>> + * >>>>> + * @rdev is marked as &Faulty excluding any cases: >>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>>> */ >>>>> static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) >>>>> { >>>>> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev) >>>>> spin_lock_irqsave(&conf->device_lock, flags); >>>>> if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) { >>>>> + if (test_bit(FailfastIOFailure, &rdev->flags)) { >>>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>>> + return; >>>>> + } >>>>> set_bit(MD_BROKEN, &mddev->flags); >>>>> if (!mddev->fail_last_dev) { >>>>> >>>> . >>>> >>> >> >