On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote: > Hello, > > Recently, we encountered data loss when using XFS on an HDD with bad > blocks. After investigation, we determined that the issue was related > to writeback errors. The details are as follows: > > 1. Process-A writes data to a file using buffered I/O and completes > without errors. > 2. However, during the writeback of the dirtied pagecache pages, an > I/O error occurs, causing the data to fail to reach the disk. > 3. Later, the pagecache pages may be reclaimed due to memory pressure, > since they are already clean pages. > 4. When Process-B reads the same file, it retrieves zeroed data from > the bad blocks, as the original data was never successfully written > (IOMAP_UNWRITTEN). > > We reviewed the related discussion [0] and confirmed that this is a > known writeback error issue. While using fsync() after buffered > write() could mitigate the problem, this approach is impractical for > our services. > > Instead, we propose introducing configurable options to notify users > of writeback errors immediately and prevent further operations on > affected files or disks. Possible solutions include: > > - Option A: Immediately shut down the filesystem upon writeback errors. > - Option B: Mark the affected file as inaccessible if a writeback error occurs. > > These options could be controlled via mount options or sysfs > configurations. Both solutions would be preferable to silently > returning corrupted data, as they ensure users are aware of disk > issues and can take corrective action. > > Any suggestions ? Option C: report all those write errors (direct and buffered) to a daemon and let it figure out what it wants to do: https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21 https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21 Yes this is a long term option since it involves adding upcalls from the pagecache/vfs into the filesystem and out through even more XFS code, which has to go through its usual rigorous reviews. But if there's interest then I could move up the timeline on submitting those since I wasn't going to do much with any of that until 2026. --D > [0] https://lwn.net/Articles/724307/ > > -- > Regards > Yafang