Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails

Kenta Akagi <k@xxxxxxx> · Thu, 21 Aug 2025 02:09:32 +0900

On 2025/08/19 0:45, Yu Kuai wrote:
> Hi,
> 
> 在 2025/8/18 20:48, Kenta Akagi 写道:
>> On 2025/08/18 11:48, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/08/18 10:05, Yu Kuai 写道:
>>>> Hi,
>>>>
>>>> 在 2025/08/18 1:27, Kenta Akagi 写道:
>>>>> A super_write IO failure with MD_FAILFAST must not cause the array
>>>>> to fail.
>>>>>
>>>>> Because a failfast bio may fail even when the rdev is not broken,
>>>>> so IO must be retried rather than failing the array when a metadata
>>>>> write with MD_FAILFAST fails on the last rdev.
>>>> Why just last rdev? If failfast can fail when the rdev is not broken, I
>>>> feel we should retry for all the rdev.
>> Thank you for reviewing.
>>
>> The reason this retry applies only to the last rdev is that the purpose
>> of failfast is to quickly detach a faulty device and thereby minimize
>> mdev IO latency on rdev failure.
>> If md retries all rdevs, the Faulty handler will no longer act
>> quickly enough, which will always "cause long delays" [1].
>> I believe this is not the behavior users want.
>>
>> [1] https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/mdadm.8.in?h=main&id=34f21b7acea8afbea9348d0f421beeeedca7a136#n784
>>
>>> BTW, I couldn't figure out the reason, why failfast is added for the
>>> meta write. I do feel just remove this flag for metadata write will fix
>>> this problem.
>> By issuing metadata writes with failfast in md, it becomes possible to
>> detect rdev failures quickly.
>> Most applications never issue IO with the REQ_FAILFAST flag set,
>> so if md issues its metadata writes without failfast,
>> rdev failures would not be detected quickly.
>> This would undermine the point of the md's failfast feature.
>> And this would also "cause long delays" [1].
>> I believe this is also not what users want.
> 
> Yes, this make sense. But I was thinking failfast will work on normal IO,
> not metadata IO like updating superblock, which doesn't happen quite often
> for user. But consider we have this behavior for such a long time, I agree
> we'd better not change it.

Thank you for reviewing.

Sorry, I forgot to mention that in my environment bitmap is enabled, so 
super_write is called frequently.

Also, what I said earlier was incorrect. md actively attaches MD_FAILFAST 
not only for metadata writes but also for other requests.
Therefore, the current patch is insufficient to achieve the prevent last
rdev fails by failfast.

I will submit a new patch with the fix.

Thanks,
Akagi

> 
>>> Thanks,
>>> Kuai
>>>
>>>>> A metadata write with MD_FAILFAST is retried after failure as
>>>>> follows:
>>>>>
>>>>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>>>>
>>>>> 2. In md_super_wait, which is called by the function that
>>>>> executed md_super_write and waits for completion,
>>>>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>>>>
>>>>> 3. The caller of md_super_wait (such as md_update_sb)
>>>>> receives a negative return value and then retries md_super_write.
>>>>>
>>>>> 4. The md_super_write function, which is called to perform
>>>>> the same metadata write, issues a write bio without MD_FAILFAST
>>>>> this time.
>>>>>
>>>>> When a write from super_written without MD_FAILFAST fails,
>>>>> the array may broken, and MD_BROKEN should be set.
>>>>>
>>>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>>>> calling md_error on the last rdev in RAID1/10 always sets
>>>>> the MD_BROKEN flag on the array.
>>>>> As a result, when failfast IO fails on the last rdev, the array
>>>>> immediately becomes failed.
>>>>>
>>>>> This commit prevents MD_BROKEN from being set when a super_write with
>>>>> MD_FAILFAST fails on the last rdev, ensuring that the array does
>>>>> not become failed due to failfast IO failures.
>>>>>
>>>>> Failfast IO failures on any rdev except the last one are not retried
>>>>> and are marked as Faulty immediately. This minimizes array IO latency
>>>>> when an rdev fails.
>>>>>
>>>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>>>>> Signed-off-by: Kenta Akagi <k@xxxxxxx>
>>>>> ---
>>>>>    drivers/md/md.c     |  9 ++++++---
>>>>>    drivers/md/md.h     |  7 ++++---
>>>>>    drivers/md/raid1.c  | 12 ++++++++++--
>>>>>    drivers/md/raid10.c | 12 ++++++++++--
>>>>>    4 files changed, 30 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>>> index ac85ec73a409..61a8188849a3 100644
>>>>> --- a/drivers/md/md.c
>>>>> +++ b/drivers/md/md.c
>>>>> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
>>>>>        if (bio->bi_status) {
>>>>>            pr_err("md: %s gets error=%d\n", __func__,
>>>>>                   blk_status_to_errno(bio->bi_status));
>>>>> +        if (bio->bi_opf & MD_FAILFAST)
>>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>> I think it's better to retry the bio with the flag cleared, then all
>>>> underlying procedures can stay the same.
>> That might be a better approach. I'll check the call hierarchy and lock dependencies.
> 
> You might need to add a new async work to resubmit this bio.
> 
> Thanks,
> Kuai
> 
>> Thanks,
>> Akagi
>>
>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>>>            md_error(mddev, rdev);
>>>>>            if (!test_bit(Faulty, &rdev->flags)
>>>>>                && (bio->bi_opf & MD_FAILFAST)) {
>>>>> +            pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>>>>> +                mdname(mddev), rdev->bdev);
>>>>>                set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>>> -            set_bit(LastDev, &rdev->flags);
>>>>>            }
>>>>>        } else
>>>>> -        clear_bit(LastDev, &rdev->flags);
>>>>> +        clear_bit(FailfastIOFailure, &rdev->flags);
>>>>>        bio_put(bio);
>>>>> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
>>>>>        if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
>>>>>            test_bit(FailFast, &rdev->flags) &&
>>>>> -        !test_bit(LastDev, &rdev->flags))
>>>>> +        !test_bit(FailfastIOFailure, &rdev->flags))
>>>>>            bio->bi_opf |= MD_FAILFAST;
>>>>>        atomic_inc(&mddev->pending_writes);
>>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>>> index 51af29a03079..cf989aca72ad 100644
>>>>> --- a/drivers/md/md.h
>>>>> +++ b/drivers/md/md.h
>>>>> @@ -281,9 +281,10 @@ enum flag_bits {
>>>>>                     * It is expects that no bad block log
>>>>>                     * is present.
>>>>>                     */
>>>>> -    LastDev,        /* Seems to be the last working dev as
>>>>> -                 * it didn't fail, so don't use FailFast
>>>>> -                 * any more for metadata
>>>>> +    FailfastIOFailure,    /* A device that failled a metadata write
>>>>> +                 * with failfast.
>>>>> +                 * error_handler must not fail the array
>>>>> +                 * if last device has this flag.
>>>>>                     */
>>>>>        CollisionCheck,        /*
>>>>>                     * check if there is collision between raid1
>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>> index 408c26398321..fc7195e58f80 100644
>>>>> --- a/drivers/md/raid1.c
>>>>> +++ b/drivers/md/raid1.c
>>>>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>     *    - recovery is interrupted.
>>>>>     *    - &mddev->degraded is bumped.
>>>>>     *
>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>> - * &mddev->fail_last_dev is off.
>>>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>>>>> + * failed in failfast and will be retried, so the @mddev did not fail.
>>>>> + *
>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>     */
>>>>>    static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>    {
>>>>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>        if (test_bit(In_sync, &rdev->flags) &&
>>>>>            (conf->raid_disks - mddev->degraded) == 1) {
>>>>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>> +            return;
>>>>> +        }
>>>>>            set_bit(MD_BROKEN, &mddev->flags);
>>>>>            if (!mddev->fail_last_dev) {
>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>> index b60c30bfb6c7..ff105a0dcd05 100644
>>>>> --- a/drivers/md/raid10.c
>>>>> +++ b/drivers/md/raid10.c
>>>>> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int ignore)
>>>>>     *    - recovery is interrupted.
>>>>>     *    - &mddev->degraded is bumped.
>>>>>     *
>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>> - * &mddev->fail_last_dev is off.
>>>>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>>>>> + * failed in failfast, so the @mddev did not fail.
>>>>> + *
>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>     */
>>>>>    static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>    {
>>>>> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>        spin_lock_irqsave(&conf->device_lock, flags);
>>>>>        if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
>>>>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>> +            return;
>>>>> +        }
>>>>>            set_bit(MD_BROKEN, &mddev->flags);
>>>>>            if (!mddev->fail_last_dev) {
>>>>>
>>>> .
>>>>
>>>
>>
>