Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails

Kenta Akagi <k@xxxxxxx> · Thu, 28 Aug 2025 02:31:32 +0900

On 2025/08/23 10:54, Li Nan wrote:
> 
> 
> 在 2025/8/18 1:27, Kenta Akagi 写道:
>> A super_write IO failure with MD_FAILFAST must not cause the array
>> to fail.
>>
>> Because a failfast bio may fail even when the rdev is not broken,
>> so IO must be retried rather than failing the array when a metadata
>> write with MD_FAILFAST fails on the last rdev.
>>
>> A metadata write with MD_FAILFAST is retried after failure as
>> follows:
>>
>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>
>> 2. In md_super_wait, which is called by the function that
>> executed md_super_write and waits for completion,
>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>
>> 3. The caller of md_super_wait (such as md_update_sb)
>> receives a negative return value and then retries md_super_write.
>>
>> 4. The md_super_write function, which is called to perform
>> the same metadata write, issues a write bio without MD_FAILFAST
>> this time.
>>
>> When a write from super_written without MD_FAILFAST fails,
>> the array may broken, and MD_BROKEN should be set.
>>
>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>> calling md_error on the last rdev in RAID1/10 always sets
>> the MD_BROKEN flag on the array.
>> As a result, when failfast IO fails on the last rdev, the array
>> immediately becomes failed.
>>
>> This commit prevents MD_BROKEN from being set when a super_write with
>> MD_FAILFAST fails on the last rdev, ensuring that the array does
>> not become failed due to failfast IO failures.
>>
>> Failfast IO failures on any rdev except the last one are not retried
>> and are marked as Faulty immediately. This minimizes array IO latency
>> when an rdev fails.
>>
>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>> Signed-off-by: Kenta Akagi <k@xxxxxxx>
> 
> 
> [...]
> 
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>    *    - recovery is interrupted.
>>    *    - &mddev->degraded is bumped.
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>> + * failed in failfast and will be retried, so the @mddev did not fail.
>> + *
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>    */
>>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>         if (test_bit(In_sync, &rdev->flags) &&
>>           (conf->raid_disks - mddev->degraded) == 1) {
>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>> +            return;
>> +        }
>>           set_bit(MD_BROKEN, &mddev->flags);
>>             if (!mddev->fail_last_dev) {
> 
> At this point, users who try to fail this rdev will get a successful return
> without Faulty flag. Should we consider it?
Hi,
Sorry for the late reply.

I will submit a v3 patch that sets the flag just before md_error and 
modifies raid{1,10}_error to use test_and_clear_bit.
Therefore, echo "faulty" to dev-*/state should always correctly 
mark the device as faulty.

Thanks,
Akagi

>