On 2025/09/02 1:48, Kenta Akagi wrote: > Hi, By the way, I found two more issues. This can be reproduced by causing a Failfast IO failure using the attached procedure. I wrote patches for them. Would it make sense to send these as v4? * raid1,raid10: IO error returned even when all writes succeed in narrow_write_error When only one rdev remains, a failfast write bio is retried in narrow_write_error() via handle_write_completed(). Even if all retries succeed, R1BIO_Uptodate is not set. As a result, the upper layer get BLK_STS_IOERR. This behavior appears to be intentional, under the assumption of at least one bad sector. However, this behavior is undesirable for failfast retries. In addition, although less likely, even on a normal HDD or SSD, if all split writes succeed, the upper layer would still receive BLK_STS_IOERR despite the write having ultimately completed successfully. Of course, this does not apply if the other rdev writes succeed. I'm not exactly sure how to handle the following point. - Can a retried write be marked R1BIO_Uptodate after one failure in non-failfast? - Should it only be marked Uptodate for failfast? - Should failfast retries be handled outside narrow_write_error? For now, I attempt to set R1BIO_Uptodate whenever narrow_write_error() succeeds and there are no bad blocks, regardless of MD_FAILFAST. * raid10: no retry scheduled for failfast read failure raid10_end_read_request lacks a path to retry when a FailFast IO fails. As a result, when Failfast Read IOs fail on all rdevs, the upper layer receives BLK_STS_IOERR without a retry, Looking at the two commits below, it seems only raid10_end_read_request lacks the failfast read retry handling, while raid1_end_read_request has it. In RAID1, the retry works as expected. I don't know why raid10_end_read_request lacks this, but it is probably just a simple oversight. - 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.") - 2e52d449bcec ("md/raid1: add failfast handling for reads.") Here's how to reproduce : # prepare nvmet/nvme-tcp and md array sh-5.2# cat << 'EOF' > loopback-nvme.sh > set -eu nqn="nqn.2025-08.io.example:nvmet-test-$1" back=$2 cd /sys/kernel/config/nvmet/ mkdir subsystems/$nqn echo 1 > subsystems/${nqn}/attr_allow_any_host mkdir subsystems/${nqn}/namespaces/1 echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path echo 1 > subsystems/${nqn}/namespaces/1/enable ports="ports/1" if [ ! -d $ports ]; then mkdir $ports cd $ports echo 127.0.0.1 > addr_traddr echo tcp > addr_trtype echo 4420 > addr_trsvcid echo ipv4 > addr_adrfam cd ../../ fi ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/ nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420 > EOF sh-5.2# chmod +x loopback-nvme.sh sh-5.2# modprobe -a nvme-tcp nvmet-tcp sh-5.2# truncate -s 1g a.img b.img sh-5.2# losetup --show -f a.img /dev/loop0 sh-5.2# losetup --show -f b.img /dev/loop1 sh-5.2# ./loopback-nvme.sh 0 /dev/loop0 connecting to device: nvme0 sh-5.2# ./loopback-nvme.sh 1 /dev/loop1 connecting to device: nvme1 sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \ --failfast /dev/nvme0n1 --failfast /dev/nvme1n1 ... mdadm: array /dev/md0 started. # run fio sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \ --bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1 # It can reproduce the issue by block nvme traffic during fio sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP; sh-5.2# sleep 10; # twice the default KATO value sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 408c26398321..ce4dff63f50a 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c static bool narrow_write_error(struct r1bio *r1_bio, int i) { struct mddev *mddev = r1_bio->mddev; @@ -2519,13 +2521,16 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i) sector_t sector; int sectors; int sect_to_write = r1_bio->sectors; - bool ok = true; + bool write_ok = true; + bool setbad_ok = true; + bool bbl_enabled = !(rdev->badblocks.shift < 0); - if (rdev->badblocks.shift < 0) - return false; + if (bbl_enabled) + block_sectors = roundup(1 << rdev->badblocks.shift, + bdev_logical_block_size(rdev->bdev) >> 9); + else + block_sectors = 1; - block_sectors = roundup(1 << rdev->badblocks.shift, - bdev_logical_block_size(rdev->bdev) >> 9); sector = r1_bio->sector; sectors = ((sector + block_sectors) & ~(sector_t)(block_sectors - 1)) @@ -2543,18 +2554,25 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i) bio_trim(wbio, sector - r1_bio->sector, sectors); wbio->bi_iter.bi_sector += rdev->data_offset; - if (submit_bio_wait(wbio) < 0) + if (submit_bio_wait(wbio) < 0) { /* failure! */ - ok = rdev_set_badblocks(rdev, sector, - sectors, 0) - && ok; + write_ok = false; + if (bbl_enabled) + setbad_ok = rdev_set_badblocks(rdev, sector, + sectors, 0) + && setbad_ok; + } bio_put(wbio); sect_to_write -= sectors; sector += sectors; sectors = block_sectors; } - return ok; + + if (!write_ok && + (!setbad_ok || !bbl_enabled)) + return false; + return true; } static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio) @@ -2585,26 +2603,34 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio) int m, idx; bool fail = false; - for (m = 0; m < conf->raid_disks * 2 ; m++) - if (r1_bio->bios[m] == IO_MADE_GOOD) { - struct md_rdev *rdev = conf->mirrors[m].rdev; + for (m = 0; m < conf->raid_disks * 2 ; m++) { + struct md_rdev *rdev = conf->mirrors[m].rdev; + struct bio *bio = r1_bio->bios[m]; + if (bio == IO_MADE_GOOD) { rdev_clear_badblocks(rdev, r1_bio->sector, r1_bio->sectors, 0); rdev_dec_pending(rdev, conf->mddev); - } else if (r1_bio->bios[m] != NULL) { + } else if (bio != NULL) { /* This drive got a write error. We need to * narrow down and record precise write * errors. */ fail = true; - if (!narrow_write_error(r1_bio, m)) - md_error(conf->mddev, - conf->mirrors[m].rdev); + if (!narrow_write_error(r1_bio, m)){ + md_error(conf->mddev, rdev); /* an I/O failed, we can't clear the bitmap */ - rdev_dec_pending(conf->mirrors[m].rdev, - conf->mddev); + } else if(test_bit(In_sync, &rdev->flags) && + !test_bit(Faulty, &rdev->flags) && + rdev_has_badblock(rdev, + r1_bio->sector, + r1_bio->sectors) == 0) { + set_bit(R1BIO_Uptodate, &r1_bio->state); + } + rdev_dec_pending(rdev, conf->mddev); } + } if (fail) { spin_lock_irq(&conf->device_lock); list_add(&r1_bio->retry_list, &conf->bio_end_io_list); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index b60c30bfb6c7..7145daf1543b 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio) * wait for the 'master' bio. */ set_bit(R10BIO_Uptodate, &r10_bio->state); + } else if (test_bit(FailFast, &rdev->flags) && + test_bit(R10BIO_FailFast, &r10_bio->state)) { + /* This was a fail-fast read so we definitely + * want to retry */ + ; } else if (!raid1_should_handle_error(bio)) { uptodate = 1; } else { > > On 2025/09/01 16:48, Yu Kuai wrote: >> Hi, >> >> 在 2025/09/01 12:22, Kenta Akagi 写道: >>> >>> >>> On 2025/09/01 12:22, Li Nan wrote: >>>> >>>> >>>> 在 2025/8/31 2:10, Kenta Akagi 写道: >>>>> >>>>> >>>>> On 2025/08/30 17:48, Li Nan wrote: >>>>>> >>>>>> >>>>>> 在 2025/8/29 20:21, Kenta Akagi 写道: >>>>>>> >>>>>>> >>>>>>> On 2025/08/29 11:54, Li Nan wrote: >>>>>>>> >>>>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道: >>>>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put >>>>>>>>> the array into a broken state. >>>>>>>>> >>>>>>>>> When failfast is enabled on rdev in RAID1 or RAID10, >>>>>>>>> the array may be flagged MD_BROKEN in the following cases. >>>>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously >>>>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails >>>>>>>> >>>>>>>> [...] >>>>>>>> >>>>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>>>>>>>> index 408c26398321..8a61fd93b3ff 100644 >>>>>>>>> --- a/drivers/md/raid1.c >>>>>>>>> +++ b/drivers/md/raid1.c >>>>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio) >>>>>>>>> (bio->bi_opf & MD_FAILFAST) && >>>>>>>>> /* We never try FailFast to WriteMostly devices */ >>>>>>>>> !test_bit(WriteMostly, &rdev->flags)) { >>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>>> md_error(r1_bio->mddev, rdev); >>>>>>>>> } >>>>>>>>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>>>>>>>> * - recovery is interrupted. >>>>>>>>> * - &mddev->degraded is bumped. >>>>>>>>> * >>>>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and >>>>>>>>> - * &mddev->fail_last_dev is off. >>>>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev, >>>>>>>>> + * then @mddev and @rdev will not be marked as failed. >>>>>>>>> + * >>>>>>>>> + * @rdev is marked as &Faulty excluding any cases: >>>>>>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>>>>>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set >>>>>>>>> */ >>>>>>>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>>>>> { >>>>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>>>>>>>> if (test_bit(In_sync, &rdev->flags) && >>>>>>>>> (conf->raid_disks - mddev->degraded) == 1) { >>>>>>>>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) { >>>>>>>>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>>>>>>>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " >>>>>>>>> + "last device but ignoring it\n", >>>>>>>>> + mdname(mddev), rdev->bdev); >>>>>>>>> + return; >>>>>>>>> + } >>>>>>>>> set_bit(MD_BROKEN, &mddev->flags); >>>>>>>>> if (!mddev->fail_last_dev) { >>>>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio) >>>>>>>>> if (test_bit(FailFast, &rdev->flags)) { >>>>>>>>> /* Don't try recovering from here - just fail it >>>>>>>>> * ... unless it is the last working device of course */ >>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>>> md_error(mddev, rdev); >>>>>>>>> if (test_bit(Faulty, &rdev->flags)) >>>>>>>>> /* Don't try to read from here, but make sure >>>>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >>>>>>>>> fix_read_error(conf, r1_bio); >>>>>>>>> unfreeze_array(conf); >>>>>>>>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>>> md_error(mddev, rdev); >>>>>>>>> } else { >>>>>>>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >>>>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c >>>>>>>>> index b60c30bfb6c7..530ad6503189 100644 >>>>>>>>> --- a/drivers/md/raid10.c >>>>>>>>> +++ b/drivers/md/raid10.c >>>>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio) >>>>>>>>> dec_rdev = 0; >>>>>>>>> if (test_bit(FailFast, &rdev->flags) && >>>>>>>>> (bio->bi_opf & MD_FAILFAST)) { >>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags); >>>>>>>>> md_error(rdev->mddev, rdev); >>>>>>>>> } >>>>>>>>> >>>>>>>> >>>>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'. >>>>>>>> >>>>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below: >>>>>>> >>>>>>>> IO1 IO2 >>>>>>>> set FailfastIOFailure >>>>>>>> set FailfastIOFailure >>>>>>>> md_error >>>>>>>> raid1_error >>>>>>>> test_and_clear FailfastIOFailur >>>>>>>> md_error >>>>>>>> raid1_error >>>>>>>> //FailfastIOFailur is cleared >>>>>>>> set MD_BROKEN >>>>>>>> >>>>>>>> Maybe we should check whether FailfastIOFailure is already set before >>>>>>>> setting it. It also needs to be considered in metadata writes. >>>>>>> Thank you for reviewing. >>>>>>> >>>>>>> I agree, this seems to be as you described. >>>>>>> So, should it be implemented as follows? >>>>>>> >>>>>>> bool old=false; >>>>>>> do{ >>>>>>> spin_lock_irqsave(&conf->device_lock, flags); >>>>>>> old = test_and_set_bit(FailfastIOFailure, &rdev->flags); >>>>>>> spin_unlock_irqrestore(&conf->device_lock, flags); >>>>>>> }while(old); >>>>>>> >>>>>>> However, since I am concerned about potential deadlocks, >>>>>>> so I am considering two alternative approaches: >>>>>>> >>>>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures. >>>>>>> >>>>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing >>>>>>> error handling of Failfast and non-Failfast IOs appears to be rare. >>>>>>> In any case, the final outcome would be the same, i.e. the array >>>>>>> ends up with MD_BROKEN. Therefore, I think this should not cause >>>>>>> issues. I think the same applies to test_and_set_bit. >>>>>>> >>>>>>> IO1 IO2 IO3 >>>>>>> FailfastIOFailure Normal IOFailure FailfastIOFailure >>>>>>> atomic_inc >>>>>>> md_error atomic_inc >>>>>>> raid1_error >>>>>>> atomic_dec //2to1 >>>>>>> md_error >>>>>>> raid1_error md_error >>>>>>> atomic_dec //1to0 raid1_error >>>>>>> atomic_dec //0 >>>>>>> set MD_BROKEN >>>>>>> >>>>>>> * Alternatively, create a separate error handler, >>>>>>> e.g. md_error_failfast(), that clearly does not fail the array. >>>>>>> >>>>>>> This approach would require somewhat larger changes and may not >>>>>>> be very elegant, but it seems to be a reliable way to ensure >>>>>>> MD_BROKEN is never set at the wrong timing. >>>>>>> >>>>>>> Which of these three approaches would you consider preferable? >>>>>>> I would appreciate your feedback. >>>>>>> >>>>>>> >>>>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED >>>>>>> when the array is degraded. >>>>>>> >>>>>>> Thanks, >>>>>>> Akagi >>>>>>> >>>>>> >>>>>> I took a closer look at the FailFast code and found a few issues, using >>>>>> RAID1 as an example: >>>>>> >>>>>> For normal read/write IO, FailFast is only triggered when there is another >>>>>> disk is available, as seen in read_balance() and raid1_write_request(). >>>>>> In raid1_error(), MD_BROKEN is set only when no other disks are available. >>>>> >>>>> Hi, >>>>> Agree, I think so too. >>>>> >>>>>> So, the FailFast for normal read/write is not triggered in the scenario you >>>>>> described in cover-letter. >>>>> >>>>> This corresponds to the case described in the commit message of PATCH v3 1/3. >>>>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is >>>>> an edge case; however, it can occur if rdevs in a non-degraded >>>>> array share the same path and that path is lost, or if a metadata >>>>> write is triggered in a degraded array and fails due to failfast." >>>>> >>>>> To describe it in more detail, the flow is as follows: >>>>> >>>>> Prerequisites: >>>>> >>>>> - Both rdevs are in-sync >>>>> - Both rdevs have in-flight MD_FAILFAST bios >>>>> - Both rdevs depend on the same lower-level path >>>>> (e.g., nvme-tcp over a single Ethernet interface) >>>>> >>>>> Sequence: >>>>> >>>>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage), >>>>> in the case of nvme-tcp: >>>>> - The Ethernet connection is lost on the node where md is running over 5 seconds >>>>> - Then the connection is restored. Idk the details of nvme-tcp implementation, >>>>> but it seems that failfast IOs finish only after the connection is back. >>>>> - All failfast bios fail, raid1_end_write_request is called. >>>>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev. >>>>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely. >>>>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding. >>>>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur. >>>>> >>>> >>>> Thanks for your patient explanation. I understand. Maybe we need a separate >>>> error-handling path for failfast. How about adding an extra parameter to md_error()? >>> >>> Thank you for reviewing. >>> >>> I am thinking of proceeding like as follows. >>> md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function. >>> >> >> It doesn't matter if it's a exported symbol, we should just keep code as >> simple as possible. >>> ... >>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>> index ac85ec73a409..855cddeb0c09 100644 >>> --- a/drivers/md/md.c >>> +++ b/drivers/md/md.c >>> @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread); >>> >>> -void md_error(struct mddev *mddev, struct md_rdev *rdev) >>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail) >>> { >>> @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) >>> return; >>> - mddev->pers->error_handler(mddev, rdev); >>> + mddev->pers->error_handler(mddev, rdev, nofail); >>> >>> @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) >>> } >>> + >>> +void md_error(struct mddev *mddev, struct md_rdev *rdev) >>> +{ >>> + return _md_error(mddev, rdev, false); >>> +} >>> EXPORT_SYMBOL(md_error); >>> >>> +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev) >>> +{ >>> + WARN_ON(mddev->pers->head.id != ID_RAID1 && >>> + mddev->pers->head.id != ID_RAID10); >>> + return _md_error(mddev, rdev, true); >>> +} >>> +EXPORT_SYMBOL(md_error_failfast); >>> + >> >> I will prefer we add a common procedures to fix this problme. >> >> How about the first patch to serialize all the md_error(), and then >> and a new helper md_bio_failue_error(mddev, rdev, bio), called when >> bio is failed, in this helper: >> >> 1) if bio is not failfast, call md_error() and return true; otherwise: >> 2) if rdev contain the last data copy, return false directly, caller >> should check return value and retry, otherwise: >> 3) call md_error() and return true; > > Hi, > I think this approach has some issues. There are cases where md_error is > called only when MD_FAILFAST is set. > > One example is the processing below in raid1_end_write_request. > >> Then, for raid1, the callers will look like: >> >> iff --git a/drivers/md/md.c b/drivers/md/md.c >> index 1baaf52c603c..c6d150e9f1a7 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio) >> if (bio->bi_status) { >> pr_err("md: %s gets error=%d\n", __func__, >> blk_status_to_errno(bio->bi_status)); >> - md_error(mddev, rdev); >> - if (!test_bit(Faulty, &rdev->flags) >> - && (bio->bi_opf & MD_FAILFAST)) { >> + if (!md_bio_failure_error(mddev, rdev, bio)) { >> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); >> set_bit(LastDev, &rdev->flags); >> } >> >> @@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio) >> set_bit(MD_RECOVERY_NEEDED, & >> conf->mddev->recovery); >> >> - if (test_bit(FailFast, &rdev->flags) && >> - (bio->bi_opf & MD_FAILFAST) && >> /* We never try FailFast to WriteMostly devices */ >> - !test_bit(WriteMostly, &rdev->flags)) { >> - md_error(r1_bio->mddev, rdev); >> - } >> - >> - /* >> - * When the device is faulty, it is not necessary to >> - * handle write error. >> - */ >> - if (!test_bit(Faulty, &rdev->flags)) >> + if(!test_bit(WriteMostly, &rdev->flags) && >> + !md_bio_failure_error(mddev, rdev, bio)) { >> set_bit(R1BIO_WriteError, &r1_bio->state); >> - else { >> + } else { >> /* Finished with this branch */ >> r1_bio->bios[mirror] = NULL; >> to_put = bio; > > In the current raid1_end_write_request implementation, > - md_error is called only in the Failfast case. > - Afterwards, if the rdev is not Faulty (that is, not Failfast, > or Failfast but the last rdev — which originally was not expected > MD_BROKEN in RAID1), R1BIO_WriteError is set. > In the suggested implementation, it seems that a non-Failfast write > failure will immediately mark the rdev as Faulty, without retries. > > This could be avoided by testing MD_FAILFAST before call the > new helper md_bio_failure_error, but I believe duplicating the > same check in both caller/callee would be undesirable. > > Should we try to avoid modifying pers->error_handler? > One possible alternative approach is as follows. > > - serialize calls to md_error regardless of whether Failfast or not > - raid{1,10}_error is: > - The remaining copy (rdev) is marked with the LastDev flag > - clear MD_FAILFAST_SUPPORTED for prohibit super_write using Failfast > - super_written will simply put MD_SB_NEED_REWRITE without calling > md_error when MD_FAILFAST bio and LastDev rdev. > > After the changes, I believe it is rare for super_written to be called with error on > multiple rdevs due to failfast. super_write is caused by errors from normal failfast > IO and invoked via MD_SB_CHANGE_DEVS through the serialized raid1_error. Since > MD_FAILFAST_SUPPORTED is cleared, metadata writes occur without failfast. > > It's not exactly a common procedure, but as it doesn't add functions to md.c, > I think this approach is preferable to adding md_error_failfast(). > > ... > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 1baaf52c603c..ba524fa96091 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1003,14 +1003,15 @@ static void super_written(struct bio *bio) > if (bio->bi_status) { > pr_err("md: %s gets error=%d\n", __func__, > blk_status_to_errno(bio->bi_status)); > - md_error(mddev, rdev); > - if (!test_bit(Faulty, &rdev->flags) > + if (test_bit(LastDev, &rdev->flags) > && (bio->bi_opf & MD_FAILFAST)) { > + pr_warn("md: %s: Metadata write will be repeated to %pg\n", > + mdname(mddev), rdev->bdev); > set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags); > - set_bit(LastDev, &rdev->flags); > + } else { > + md_error(mddev, rdev); > } > - } else > - clear_bit(LastDev, &rdev->flags); > + } > > bio_put(bio); > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 408c26398321..a52c5277add7 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio) > (bio->bi_opf & MD_FAILFAST) && > /* We never try FailFast to WriteMostly devices */ > !test_bit(WriteMostly, &rdev->flags)) { > - md_error(r1_bio->mddev, rdev); > + raid1_md_error_failfast(r1_bio->mddev, rdev); > } > > /* > @@ -1733,6 +1733,27 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) > seq_printf(seq, "]"); > } > > +static void _raid1_md_error(struct mddev *mddev, struct md_rdev *rdev, bool failfast){ > + struct r1conf *conf = mddev->private; > + unsigned long flags; > + > + spin_lock_irqsave(&conf->new_lock_for_md_error, flags); > + if (failfast) > + set_bit(FailfastIOFailure, &rdev->flags); > + md_error(mddev, rdev); > + if (failfast) > + WARN_ON(!test_and_clear_bit(FailfastIOFailure, &rdev->flags)); > + spin_unlock_irqrestore(&conf->new_lock_for_md_error, flags); > +} > + > +static void raid1_md_error(struct mddev *mddev, struct md_rdev *rdev){ > + return _raid1_md_error(mddev, rdev, false); > +} > + > +static void raid1_md_error_failfast(struct mddev *mddev, struct md_rdev *rdev){ > + return _raid1_md_error(mddev, rdev, true); > +} > + > /** > * raid1_error() - RAID1 error handler. > * @mddev: affected md device. > @@ -1758,6 +1783,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > > if (test_bit(In_sync, &rdev->flags) && > (conf->raid_disks - mddev->degraded) == 1) { > + if (test_bit(FailfastIOFailure, &rdev->flags)) { > + spin_unlock_irqrestore(&conf->device_lock, flags); > + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, " > + "last device but ignoring it\n", > + mdname(mddev), rdev->bdev); > + return; > + } > set_bit(MD_BROKEN, &mddev->flags); > > if (!mddev->fail_last_dev) { > @@ -1767,8 +1799,16 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > } > } > set_bit(Blocked, &rdev->flags); > - if (test_and_clear_bit(In_sync, &rdev->flags)) > + if (test_and_clear_bit(In_sync, &rdev->flags)) { > mddev->degraded++; > + clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags); > + for (i = 0; i < conf->raid_disks; i++) { > + struct md_rdev *rdev2 = conf->mirrors[i].rdev; > + if (rdev2 && rdev != rdev2 && > + test_bit(In_sync, &rdev2->flags)) > + set_bit(LastDev, &rdev2->flags); > + } > + } > set_bit(Faulty, &rdev->flags); > spin_unlock_irqrestore(&conf->device_lock, flags); > /* > @@ -2118,7 +2158,7 @@ static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector, > } > /* need to record an error - either for the block or the device */ > if (!rdev_set_badblocks(rdev, sector, sectors, 0)) > - md_error(rdev->mddev, rdev); > + raid1_md_error(rdev->mddev, rdev); > return 0; > } > ... > > > Thanks, > Akagi > >> @@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >> */ >> >> bio = r1_bio->bios[r1_bio->read_disk]; >> - bio_put(bio); >> r1_bio->bios[r1_bio->read_disk] = NULL; >> >> rdev = conf->mirrors[r1_bio->read_disk].rdev; >> @@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio) >> freeze_array(conf, 1); >> fix_read_error(conf, r1_bio); >> unfreeze_array(conf); >> - } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) { >> - md_error(mddev, rdev); >> - } else { >> + } else if (mddev->ro == 0 && >> + !md_bio_failure_error(mddev, rdev, bio)) { >> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED; >> } >> >> + bio_put(bio); >> rdev_dec_pending(rdev, conf->mddev); >> sector = r1_bio->sector; >> - bio = r1_bio->master_bio; >> >> /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */ >> r1_bio->state = 0; >> - raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio); >> + raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, r1_bio); >> allow_barrier(conf, sector); >> } >> >> >>> /* seq_file implementation /proc/mdstat */ >>> diff --git a/drivers/md/md.h b/drivers/md/md.h >>> index 51af29a03079..6ca1aea630ce 100644 >>> --- a/drivers/md/md.h >>> +++ b/drivers/md/md.h >>> @@ -758,3 +758,3 @@ struct md_personality >>> */ >>> - void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev); >>> + void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail); >>> int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); >>> @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev); >>> extern void md_done_sync(struct mddev *mddev, int blocks, int ok); >>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail); >>> extern void md_error(struct mddev *mddev, struct md_rdev *rdev); >>> +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev); >>> extern void md_finish_reshape(struct mddev *mddev); >>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c >>> index f1d8811a542a..8aea51227a96 100644 >>> --- a/drivers/md/raid0.c >>> +++ b/drivers/md/raid0.c >>> @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev) >>> >>> -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev) >>> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev, >>> + bool nofail __maybe_unused) >>> { >>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >>> index 408c26398321..d93275899e9e 100644 >>> --- a/drivers/md/raid1.c >>> +++ b/drivers/md/raid1.c >>> @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>> * @rdev: member device to fail. >>> + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set >>> * >>> @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev) >>> * >>> - * @rdev is marked as &Faulty excluding case when array is failed and >>> - * &mddev->fail_last_dev is off. >>> + * @rdev is marked as &Faulty excluding any cases: >>> + * - when @mddev is failed and &mddev->fail_last_dev is off >>> + * - when @rdev is last device and @nofail is true >>> */ >>> -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>> +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev, >>> + bool nofail) >>> { >>> @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) >>> (conf->raid_disks - mddev->degraded) == 1) { >>> + if (nofail) { >>> + spin_unlock_irqrestore(&conf->device_lock, flags); >>> + pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, " >>> + "last device but ignoring it\n", >>> + mdname(mddev), rdev->bdev); >>> + return; >>> + } >>> set_bit(MD_BROKEN, &mddev->flags); >>> ... >>> >>>> Kuai, do you have any suggestions? >>>> >>>>>> Normal writes may call md_error() in narrow_write_error. Normal reads do >>>>>> not execute md_error() on the last disk. >>>>>> >>>>>> Perhaps you should get more information to confirm how MD_BROKEN is set in >>>>>> normal read/write IO. >>>>> >>>>> Should I add the above sequence of events to the cover letter, or commit message? >>>>> >>>> >>>> I think we should mention this in the commit message. >>> >>> Understood. I will explicitly describe this in the commit message in v4. >>> >>> Thanks, >>> Akagi >>> >>>>> Thanks, >>>>> Akagi >>>>> >>>>>> -- >>>>>> Thanks, >>>>>> Nan >>>>>> >>>>>> >>>>> >>>>> >>>>> . >>>> >>>> -- >>>> Thanks, >>>> Nan >>>> >>>> >>> >>> . >>> >> >> >