[QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

Yafang Shao <laoar.shao@xxxxxxxxx> · Thu, 29 May 2025 10:50:01 +0800

Hello,

Recently, we encountered data loss when using XFS on an HDD with bad
blocks. After investigation, we determined that the issue was related
to writeback errors. The details are as follows:

1. Process-A writes data to a file using buffered I/O and completes
without errors.
2. However, during the writeback of the dirtied pagecache pages, an
I/O error occurs, causing the data to fail to reach the disk.
3. Later, the pagecache pages may be reclaimed due to memory pressure,
since they are already clean pages.
4. When Process-B reads the same file, it retrieves zeroed data from
the bad blocks, as the original data was never successfully written
(IOMAP_UNWRITTEN).

We reviewed the related discussion [0] and confirmed that this is a
known writeback error issue. While using fsync() after buffered
write() could mitigate the problem, this approach is impractical for
our services.

Instead, we propose introducing configurable options to notify users
of writeback errors immediately and prevent further operations on
affected files or disks. Possible solutions include:

- Option A: Immediately shut down the filesystem upon writeback errors.
- Option B: Mark the affected file as inaccessible if a writeback error occurs.

These options could be controlled via mount options or sysfs
configurations. Both solutions would be preferable to silently
returning corrupted data, as they ensure users are aware of disk
issues and can take corrective action.

Any suggestions ?

[0] https://lwn.net/Articles/724307/

-- 
Regards
Yafang