On Sun, Jun 01, 2025 at 09:02:25AM +1000, Dave Chinner wrote: > On Fri, May 30, 2025 at 08:38:47AM -0700, Darrick J. Wong wrote: > > On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote: > > > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote: > > > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote: > > > > > Hello, > > > > > > > > > > Recently, we encountered data loss when using XFS on an HDD with bad > > > > > blocks. After investigation, we determined that the issue was related > > > > > to writeback errors. The details are as follows: > > > > > > > > > > 1. Process-A writes data to a file using buffered I/O and completes > > > > > without errors. > > > > > 2. However, during the writeback of the dirtied pagecache pages, an > > > > > I/O error occurs, causing the data to fail to reach the disk. > > > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure, > > > > > since they are already clean pages. > > > > > 4. When Process-B reads the same file, it retrieves zeroed data from > > > > > the bad blocks, as the original data was never successfully written > > > > > (IOMAP_UNWRITTEN). > > > > > > > > > > We reviewed the related discussion [0] and confirmed that this is a > > > > > known writeback error issue. While using fsync() after buffered > > > > > write() could mitigate the problem, this approach is impractical for > > > > > our services. > > > > > > > > > > Instead, we propose introducing configurable options to notify users > > > > > of writeback errors immediately and prevent further operations on > > > > > affected files or disks. Possible solutions include: > > > > > > > > > > - Option A: Immediately shut down the filesystem upon writeback errors. > > > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs. > > > > > > > > > > These options could be controlled via mount options or sysfs > > > > > configurations. Both solutions would be preferable to silently > > > > > returning corrupted data, as they ensure users are aware of disk > > > > > issues and can take corrective action. > > > > > > > > > > Any suggestions ? > > > > > > > > Option C: report all those write errors (direct and buffered) to a > > > > daemon and let it figure out what it wants to do: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21 > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21 > > > > > > > > Yes this is a long term option since it involves adding upcalls from the > > > > > > I hope you don't mean actual usermodehelper upcalls here because we > > > should not add any new ones. If you just mean a way to call up from a > > > lower layer than that's obviously fine. > > > > Correct. The VFS upcalls to XFS on some event, then XFS queues the > > event data (or drops it) and waits for userspace to read the queued > > events. We're not directly invoking a helper program from deep in the > > guts, that's too wild even for me. ;) > > > > > Fwiw, have you considered building this on top of a fanotify extension > > > instead of inventing your own mechanism for this? > > > > I have, at various stages of this experiment. > > > > Originally, I was only going to export xfs-specific metadata events > > (e.g. this AG's inode btree index is bad) so that the userspace program > > (xfs_healer) could initiate a repair against the broken pieces. > > > > At the time I thought it would be fun to experiment with an anonfd file > > that emitted jsonp objects so that I could avoid the usual C struct ABI > > mess because json is easily parsed into key-value mapping objects in a > > lot of languages (that aren't C). It later turned out that formatting > > the json is rather more costly than I thought even with seq_bufs, so I > > added an alternate format that emits boring C structures. > > > > Having gone back to C structs, it would be possibly (and possibly quite > > nice) to migrate to fanotify so that I don't have to maintain a bunch of > > queuing code. But that can have its own drawbacks, as Ted and I > > discovered when we discussed his patches that pushed ext4 error events > > through fanotify: > > > > For filesystem metadata events, the fine details of representing that > > metadata in a generic interface gets really messy because each > > filesystem has a different design. > > Perhaps that is the wrong approach. The event just needs to tell > userspace that there is a metadata error, and the fs specific agent > that receives the event can then pull the failure information from > the filesystem through a fs specific ioctl interface. > > i.e. the fanotify event could simply be a unique error, and that > gets passed back into the ioctl to retreive the fs specific details > of the failure. We might not even need fanotify for this - I suspect > that we could use udev events to punch error ID notifications out to > userspace to trigger a fs specific helper to go find out what went > wrong. I'm not sure if you're addressing me or brauner, but I think it would be even simpler to retain the current design where events are queued to our special xfs anonfd and read out by userspace. Using fanotify as a "door bell" to go look at another fd is ... basically poll() but far more complicated than it ought to be. Pounding udev with events can result in userspace burning a lot of energy walking the entire rule chain. > Keeping unprocessed failures in an internal fs queue isn't a big > deal; it's not a lot of memory, and it can be discarded on unmount. > At that point we know that userspace did not care about the > failure and is not going to be able to query about the failure in > future, so we can just throw it away. > > This also allows filesystems to develop such functionality in > parallel, allowing us to find commonality and potential areas for > abstraction as the functionality is developed, rahter than trying to > come up with some generic interface that needs to support all > possible things we can think of right now.... Agreed. I don't think Ted or Jan were enthusiastic about trying to make a generic fs metadata event descriptor either. --D > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx