Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Sun, 1 Jun 2025 22:38:07 -0700

On Thu, May 29, 2025 at 02:36:30PM +1000, Dave Chinner wrote:
> In these situations writeback could fail for several attempts before
> the storage timed out and came back online. Then the next write
> retry would succeed, and everything would be good. Linux never gave
> us a specific IO error for this case, so we just had to retry on EIO
> and hope that the storage came back eventually.

Linux has had differenciated I/O error codes for quite a while.  But
more importantly dm-multipath doesn't just return errors to the upper
layer during failover, but is instead expected to queue the I/O up
until it either has a working path or an internal timeout passed.

In other words, write errors in Linux are in general expected to be
persistent, modulo explicit failfast requests like REQ_NOWAIT.

Which also leaves me a bit puzzled what the XFS metadata retries are
actually trying to solve, especially without even having a corresponding
data I/O version.