On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote: > Hello, > > Recently, we encountered data loss when using XFS on an HDD with bad > blocks. After investigation, we determined that the issue was related > to writeback errors. The details are as follows: > > 1. Process-A writes data to a file using buffered I/O and completes > without errors. > 2. However, during the writeback of the dirtied pagecache pages, an > I/O error occurs, causing the data to fail to reach the disk. > 3. Later, the pagecache pages may be reclaimed due to memory pressure, > since they are already clean pages. > 4. When Process-B reads the same file, it retrieves zeroed data from > the bad blocks, as the original data was never successfully written > (IOMAP_UNWRITTEN). > > We reviewed the related discussion [0] and confirmed that this is a > known writeback error issue. While using fsync() after buffered > write() could mitigate the problem, this approach is impractical for > our services. Really, that's terrible application design. If you aren't checking that data has been written successfully, then you get to keep all the broken and/or missing data bits to yourself. However, with that said, some history. XFS used to keep pages that had IO errors on writeback dirty so they would be retried at a later time and couldn't be reclaimed from memory until they were written. This was historical behaviour from Irix and designed to handle SAN environments where multipath fail-over could take several minutes. In these situations writeback could fail for several attempts before the storage timed out and came back online. Then the next write retry would succeed, and everything would be good. Linux never gave us a specific IO error for this case, so we just had to retry on EIO and hope that the storage came back eventually. This is different to traditional Linux writeback behaviour, which is what is implemented now via iomap. There are good reasons for this model: - a filesystem with a dirty page that can't be written and cleaned cannot be unmounted. - having large chunks of memory that cannot be cleaned and reclaimed has adverse impact on system performance - the system can potentially hang if the page cache is dirtied beyond write throttling thresholds and then the device is yanked. Now none of the dirty memory can be cleaned, and all new writes are throttled.... > Instead, we propose introducing configurable options to notify users > of writeback errors immediately and prevent further operations on > affected files or disks. Possible solutions include: > > - Option A: Immediately shut down the filesystem upon writeback errors. > - Option B: Mark the affected file as inaccessible if a writeback error occurs. Go look at /sys/fs/xfs/<dev>/error/metadata/... and configurable error handling behaviour implemented through this interface. Essential, XFS metadata behaves as "retry writes forever and hang on unmount until write succeeds" by default. i.e. similar to the old data IO error behaviour. The "hang on unmount" behaviour can be turned off by /sys/fs/xfs/<dev>/error/fail_at_unmount, and we can configured different failure handling policies for different types of IO error. e.g. fail-fast on -ENODEV (e.g. device was unplugged and is never coming back so shut the filesystem down), retry-for-while on -ENOSPC (e.g. dm-thinp pool has run out of space, so give some time for the pool to be expanded before shutting down) and retry-once on -EIO (to avoid random spurious hardware failures from shutting down the fs) and everything else uses the configured default behaviour.... There's also good reason the sysfs error heirarchy is structured the way it is - it leaves open the option for expanding the error handling policies to different IO types (i.e. data and metadata). It even allows different policies for different types of data devices (e.g. RT vs data device policies). So, got look at how the error configuration code in XFS is handled, consider extending that to /sys/fs/xfs/<dev>/error/data/.... to allow different error handling policies for different types of data writeback IO errors. Then you'll need to implement those policies through the XFS and iomap IO paths... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx