Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 29 May 2025 14:36:30 +1000

On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> Hello,
> 
> Recently, we encountered data loss when using XFS on an HDD with bad
> blocks. After investigation, we determined that the issue was related
> to writeback errors. The details are as follows:
> 
> 1. Process-A writes data to a file using buffered I/O and completes
> without errors.
> 2. However, during the writeback of the dirtied pagecache pages, an
> I/O error occurs, causing the data to fail to reach the disk.
> 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> since they are already clean pages.
> 4. When Process-B reads the same file, it retrieves zeroed data from
> the bad blocks, as the original data was never successfully written
> (IOMAP_UNWRITTEN).
> 
> We reviewed the related discussion [0] and confirmed that this is a
> known writeback error issue. While using fsync() after buffered
> write() could mitigate the problem, this approach is impractical for
> our services.

Really, that's terrible application design.  If you aren't checking
that data has been written successfully, then you get to keep all
the broken and/or missing data bits to yourself.

However, with that said, some history.

XFS used to keep pages that had IO errors on writeback dirty so they
would be retried at a later time and couldn't be reclaimed from
memory until they were written. This was historical behaviour from
Irix and designed to handle SAN environments where multipath
fail-over could take several minutes.

In these situations writeback could fail for several attempts before
the storage timed out and came back online. Then the next write
retry would succeed, and everything would be good. Linux never gave
us a specific IO error for this case, so we just had to retry on EIO
and hope that the storage came back eventually.

This is different to traditional Linux writeback behaviour, which is
what is implemented now via iomap. There are good reasons for this
model:

- a filesystem with a dirty page that can't be written and cleaned
  cannot be unmounted.

- having large chunks of memory that cannot be cleaned and
  reclaimed has adverse impact on system performance

- the system can potentially hang if the page cache is dirtied
  beyond write throttling thresholds and then the device is yanked.
  Now none of the dirty memory can be cleaned, and all new writes
  are throttled....

> Instead, we propose introducing configurable options to notify users
> of writeback errors immediately and prevent further operations on
> affected files or disks. Possible solutions include:
> 
> - Option A: Immediately shut down the filesystem upon writeback errors.
> - Option B: Mark the affected file as inaccessible if a writeback error occurs.

Go look at /sys/fs/xfs/<dev>/error/metadata/... and configurable
error handling behaviour implemented through this interface.

Essential, XFS metadata behaves as "retry writes forever and hang on
unmount until write succeeds" by default. i.e. similar to the old
data IO error behaviour. The "hang on unmount" behaviour can be
turned off by /sys/fs/xfs/<dev>/error/fail_at_unmount, and we can
configured different failure handling policies for different types
of IO error. e.g. fail-fast on -ENODEV (e.g. device was unplugged
and is never coming back so shut the filesystem down),
retry-for-while on -ENOSPC (e.g. dm-thinp pool has run out of space,
so give some time for the pool to be expanded before shutting down)
and retry-once on -EIO (to avoid random spurious hardware failures
from shutting down the fs) and everything else uses the configured
default behaviour....

There's also good reason the sysfs error heirarchy is structured the
way it is - it leaves open the option for expanding the error
handling policies to different IO types (i.e. data and metadata). It
even allows different policies for different types of data devices
(e.g. RT vs data device policies).

So, got look at how the error configuration code in XFS is handled,
consider extending that to /sys/fs/xfs/<dev>/error/data/.... to
allow different error handling policies for different types of
data writeback IO errors.

Then you'll need to implement those policies through the XFS and
iomap IO paths...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx