On Thu, May 29, 2025 at 12:36 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote: > > Hello, > > > > Recently, we encountered data loss when using XFS on an HDD with bad > > blocks. After investigation, we determined that the issue was related > > to writeback errors. The details are as follows: > > > > 1. Process-A writes data to a file using buffered I/O and completes > > without errors. > > 2. However, during the writeback of the dirtied pagecache pages, an > > I/O error occurs, causing the data to fail to reach the disk. > > 3. Later, the pagecache pages may be reclaimed due to memory pressure, > > since they are already clean pages. > > 4. When Process-B reads the same file, it retrieves zeroed data from > > the bad blocks, as the original data was never successfully written > > (IOMAP_UNWRITTEN). > > > > We reviewed the related discussion [0] and confirmed that this is a > > known writeback error issue. While using fsync() after buffered > > write() could mitigate the problem, this approach is impractical for > > our services. > > Really, that's terrible application design. If you aren't checking > that data has been written successfully, then you get to keep all > the broken and/or missing data bits to yourself. It’s difficult to justify this. > > However, with that said, some history. > > XFS used to keep pages that had IO errors on writeback dirty so they > would be retried at a later time and couldn't be reclaimed from > memory until they were written. This was historical behaviour from > Irix and designed to handle SAN environments where multipath > fail-over could take several minutes. > > In these situations writeback could fail for several attempts before > the storage timed out and came back online. Then the next write > retry would succeed, and everything would be good. Linux never gave > us a specific IO error for this case, so we just had to retry on EIO > and hope that the storage came back eventually. > > This is different to traditional Linux writeback behaviour, which is > what is implemented now via iomap. There are good reasons for this > model: > > - a filesystem with a dirty page that can't be written and cleaned > cannot be unmounted. > > - having large chunks of memory that cannot be cleaned and > reclaimed has adverse impact on system performance > > - the system can potentially hang if the page cache is dirtied > beyond write throttling thresholds and then the device is yanked. > Now none of the dirty memory can be cleaned, and all new writes > are throttled.... I previously considered whether we could avoid clearing PG_writeback for these pages. To handle unwritten pagecache pages more safely, we could maintain their PG_writeback status and introduce a new PG_write_error flag. This would explicitly mark pages that failed disk writes, allowing the reclaim mechanism to skip them and avoid potential deadlocks. > > > Instead, we propose introducing configurable options to notify users > > of writeback errors immediately and prevent further operations on > > affected files or disks. Possible solutions include: > > > > - Option A: Immediately shut down the filesystem upon writeback errors. > > - Option B: Mark the affected file as inaccessible if a writeback error occurs. > > Go look at /sys/fs/xfs/<dev>/error/metadata/... and configurable > error handling behaviour implemented through this interface. > > Essential, XFS metadata behaves as "retry writes forever and hang on > unmount until write succeeds" by default. i.e. similar to the old > data IO error behaviour. The "hang on unmount" behaviour can be > turned off by /sys/fs/xfs/<dev>/error/fail_at_unmount, and we can > configured different failure handling policies for different types > of IO error. e.g. fail-fast on -ENODEV (e.g. device was unplugged > and is never coming back so shut the filesystem down), > retry-for-while on -ENOSPC (e.g. dm-thinp pool has run out of space, > so give some time for the pool to be expanded before shutting down) > and retry-once on -EIO (to avoid random spurious hardware failures > from shutting down the fs) and everything else uses the configured > default behaviour.... Thank you for your clear guidance and detailed explanation. > > There's also good reason the sysfs error heirarchy is structured the > way it is - it leaves open the option for expanding the error > handling policies to different IO types (i.e. data and metadata). It > even allows different policies for different types of data devices > (e.g. RT vs data device policies). > > So, got look at how the error configuration code in XFS is handled, > consider extending that to /sys/fs/xfs/<dev>/error/data/.... to > allow different error handling policies for different types of > data writeback IO errors. That aligns perfectly with our expectations. > > Then you'll need to implement those policies through the XFS and > iomap IO paths... I will analyze how to implement this effectively. -- Regards Yafang