On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote: > On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote: > > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote: > > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote: > > > > Hello, > > > > > > > > Recently, we encountered data loss when using XFS on an HDD with bad > > > > blocks. After investigation, we determined that the issue was related > > > > to writeback errors. The details are as follows: > > > > > > > > 1. Process-A writes data to a file using buffered I/O and completes > > > > without errors. > > > > 2. However, during the writeback of the dirtied pagecache pages, an > > > > I/O error occurs, causing the data to fail to reach the disk. > > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure, > > > > since they are already clean pages. > > > > 4. When Process-B reads the same file, it retrieves zeroed data from > > > > the bad blocks, as the original data was never successfully written > > > > (IOMAP_UNWRITTEN). > > > > > > > > We reviewed the related discussion [0] and confirmed that this is a > > > > known writeback error issue. While using fsync() after buffered > > > > write() could mitigate the problem, this approach is impractical for > > > > our services. > > > > > > > > Instead, we propose introducing configurable options to notify users > > > > of writeback errors immediately and prevent further operations on > > > > affected files or disks. Possible solutions include: > > > > > > > > - Option A: Immediately shut down the filesystem upon writeback errors. > > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs. > > > > > > > > These options could be controlled via mount options or sysfs > > > > configurations. Both solutions would be preferable to silently > > > > returning corrupted data, as they ensure users are aware of disk > > > > issues and can take corrective action. > > > > > > > > Any suggestions ? > > > > > > Option C: report all those write errors (direct and buffered) to a > > > daemon and let it figure out what it wants to do: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21 > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21 > > > > > > Yes this is a long term option since it involves adding upcalls from the > > > > I hope you don't mean actual usermodehelper upcalls here because we > > should not add any new ones. If you just mean a way to call up from a > > lower layer than that's obviously fine. > > Correct. The VFS upcalls to XFS on some event, then XFS queues the > event data (or drops it) and waits for userspace to read the queued > events. We're not directly invoking a helper program from deep in the > guts, that's too wild even for me. ;) > > > Fwiw, have you considered building this on top of a fanotify extension > > instead of inventing your own mechanism for this? > > I have, at various stages of this experiment. > > Originally, I was only going to export xfs-specific metadata events > (e.g. this AG's inode btree index is bad) so that the userspace program > (xfs_healer) could initiate a repair against the broken pieces. > > At the time I thought it would be fun to experiment with an anonfd file > that emitted jsonp objects so that I could avoid the usual C struct ABI > mess because json is easily parsed into key-value mapping objects in a > lot of languages (that aren't C). It later turned out that formatting > the json is rather more costly than I thought even with seq_bufs, so I > added an alternate format that emits boring C structures. > > Having gone back to C structs, it would be possibly (and possibly quite > nice) to migrate to fanotify so that I don't have to maintain a bunch of > queuing code. But that can have its own drawbacks, as Ted and I > discovered when we discussed his patches that pushed ext4 error events > through fanotify: > > For filesystem metadata events, the fine details of representing that > metadata in a generic interface gets really messy because each > filesystem has a different design. Perhaps that is the wrong approach. The event just needs to tell userspace that there is a metadata error, and the fs specific agent that receives the event can then pull the failure information from the filesystem through a fs specific ioctl interface. i.e. the fanotify event could simply be a unique error, and that gets passed back into the ioctl to retreive the fs specific details of the failure. We might not even need fanotify for this - I suspect that we could use udev events to punch error ID notifications out to userspace to trigger a fs specific helper to go find out what went wrong. Keeping unprocessed failures in an internal fs queue isn't a big deal; it's not a lot of memory, and it can be discarded on unmount. At that point we know that userspace did not care about the failure and is not going to be able to query about the failure in future, so we can just throw it away. This also allows filesystems to develop such functionality in parallel, allowing us to find commonality and potential areas for abstraction as the functionality is developed, rahter than trying to come up with some generic interface that needs to support all possible things we can think of right now.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx