Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 4 Jun 2025 08:05:03 +1000

On Mon, Jun 02, 2025 at 09:50:04PM -0700, Christoph Hellwig wrote:
> On Tue, Jun 03, 2025 at 09:19:10AM +1000, Dave Chinner wrote:
> > > In other words, write errors in Linux are in general expected to be
> > > persistent, modulo explicit failfast requests like REQ_NOWAIT.
> > 
> > Say what? the blk_errors array defines multiple block layer errors
> > that are transient in nature - stuff like ENOSPC, ETIMEDOUT, EILSEQ,
> > ENOLINK, EBUSY - all indicate a transient, retryable error occurred
> > somewhere in the block/storage layers.
> 
> Let's use the block layer codes reported all the way up to the file
> systems and their descriptions instead of the errnos they are
> mapped to for compatibility.  The above would be in order:
> 
> [BLK_STS_NOSPC]         = { -ENOSPC,    "critical space allocation" },
> [BLK_STS_TIMEOUT]       = { -ETIMEDOUT, "timeout" },
> [BLK_STS_PROTECTION]    = { -EILSEQ,    "protection" },
> [BLK_STS_TRANSPORT]     = { -ENOLINK,   "recoverable transport" },
> [BLK_STS_DEV_RESOURCE]  = { -EBUSY,     "device resource" },
> 
> > What is permanent about dm-thinp returning ENOSPC to a write
> > request? Once the pool has been GC'd to free up space or expanded,
> > the ENOSPC error goes away.
> 
> Everything.  ENOSPC means there is no space.  There might be space in
> the non-determinant future, but if the layer just needs to GC it must
> not report the error.

GC of thin pools requires the filesystem to be mounted so fstrim can
be run to tell the thinp device where all the free LBA regions it
can reclaim are located. If we shut down the filesystem instantly
when the pool goes ENOSPC on a metadata write, then *we can't run
fstrim* to free up unused space and hence allow that metadata write
to succeed in the future.

It should be obvious at this point that a filesystem shutdown on an
ENOSPC error from the block device on anything other than journal IO
is exactly the wrong thing to be doing.

> > What is permanent about an IO failing with EILSEQ because a t10
> > checksum failed due to a random bit error detected between the HBA
> > and the storage device? Retry the IO, and it goes through just fine
> > without any failures.
> 
> Normally it means your checksum was wrong.  If you have bit errors
> in the cable they will show up again, maybe not on the next I/O
> but soon.

But it's unlikely to be hit by another cosmic ray anytime soon, and
so bit errors caused by completely random environmental events
should -absolutely- be retried as the subsequent write retry will
succeed.

If there is a dodgy cable causing the problems, the error will
re-occur on random IOs and we'll emit write errors to the log that
monitoring software will pick up. If we are repeatedly isssuing write
errors due to EILSEQ errors, then that's a sign the hardware needs
replacing.

There is no risk to filesystem integrity if write retries
succeed, and that gives the admin time to schedule downtime to
replace the dodgy hardware. That's much better behaviour than
unexpected production system failure in the middle of the night...

It is because we have robust and resilient error handling in the
filesystem that the system is able to operate correctly in these
marginal situations. Operating in marginal conditions or as hardware
is beginning to fail is a necessary to keep production systems
running until corrective action can be taken by the administrators.

> > These transient error types typically only need a write retry after
> > some time period to resolve, and that's what XFS does by default.
> > What makes these sorts of errors persistent in the linux block layer
> > and hence requiring an immediate filesystem shutdown and complete
> > denial of service to the storage?
> > 
> > I ask this seriously, because you are effectively saying the linux
> > storage stack now doesn't behave the same as the model we've been
> > using for decades. What has changed, and when did it change?
> 
> Hey, you can retry.  You're unlikely to improve the situation though
> but instead just keep deferring the inevitable shutdown.

Absolutely. That's the whole point - random failures won't repeat,
and hence when they do occur we avoid a shutdown by retrying them on
failure. This is -exactly- how robust error handling should work.

However, for IO errors that persist or where other IO errors start
to creep in, all the default behaviour is trying to do is hold the
system up in a working state until downtime can be scheduled and the
broken hardware is replaced. If integrity ends up being compromised
by a subsequent IO failure, then we will shut the filesystem down at
that point.

This is about resilience in the face of errors. Not every error is
fatal, nor does every error re-occur. There are classes of errors
known to be transient (ENOSPC), others that are permanent (ENODEV),
and others that we just don't know (EIO). If we value resiliency
and robustness, then the filesystem should be able to withstand
transient and "maybe-transient" IO failures without compromising
integrity.

Failing to recognise that transient and "maybe-transient" errors can
generally be handled cleanly and successfully with future write
retries leads to brittle, fragile systems that fall over at the
first sign of anything going wrong. Filesystems that are targetted
at high value production systems and/or running mission critical
applications needs to have resilient and robust error handling.

> > > Which also leaves me a bit puzzled what the XFS metadata retries are
> > > actually trying to solve, especially without even having a corresponding
> > > data I/O version.
> > 
> > It's always been for preventing immediate filesystem shutdown when
> > spurious transient IO errors occur below XFS. Data IO errors don't
> > cause filesystem shutdowns - errors get propagated to the
> > application - so there isn't a full system DOS potential for
> > incorrect classification of data IO errors...
> 
> Except as we see in this thread for a fairly common use case (buffered
> I/O without fsync) they don't.  And I agree with you that this is not
> how you write applications that care about data integrity - but the
> entire reset of the system and just about every common utility is
> written that way.

Yes, I know that. But there are still valid reasons for retrying
failed async data writeback IO when it triggers a spurious or
retriable IO error....

> And even applications that fsync won't see you fancy error code.  The
> only thing stored in the address_space for fsync to catch is EIO and
> ENOSPC.

The filesystem knows exactly what the IO error reported by the block
layer is before we run folio completions, so we control exactly what
we want to report as IO compeltion status.

Hence the bogosities of error propagation to userspace via the
mapping is completely irrelevant to this discussion/feature because
it would be implemented below the layer that squashes the eventual
IO errno into the address space...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx