On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote: > On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote: > > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote: > > > Hello, > > > > > > Recently, we encountered data loss when using XFS on an HDD with bad > > > blocks. After investigation, we determined that the issue was related > > > to writeback errors. The details are as follows: > > > > > > 1. Process-A writes data to a file using buffered I/O and completes > > > without errors. > > > 2. However, during the writeback of the dirtied pagecache pages, an > > > I/O error occurs, causing the data to fail to reach the disk. > > > 3. Later, the pagecache pages may be reclaimed due to memory pressure, > > > since they are already clean pages. > > > 4. When Process-B reads the same file, it retrieves zeroed data from > > > the bad blocks, as the original data was never successfully written > > > (IOMAP_UNWRITTEN). > > > > > > We reviewed the related discussion [0] and confirmed that this is a > > > known writeback error issue. While using fsync() after buffered > > > write() could mitigate the problem, this approach is impractical for > > > our services. > > > > > > Instead, we propose introducing configurable options to notify users > > > of writeback errors immediately and prevent further operations on > > > affected files or disks. Possible solutions include: > > > > > > - Option A: Immediately shut down the filesystem upon writeback errors. > > > - Option B: Mark the affected file as inaccessible if a writeback error occurs. > > > > > > These options could be controlled via mount options or sysfs > > > configurations. Both solutions would be preferable to silently > > > returning corrupted data, as they ensure users are aware of disk > > > issues and can take corrective action. > > > > > > Any suggestions ? > > > > Option C: report all those write errors (direct and buffered) to a > > daemon and let it figure out what it wants to do: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21 > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21 > > > > Yes this is a long term option since it involves adding upcalls from the > > I hope you don't mean actual usermodehelper upcalls here because we > should not add any new ones. If you just mean a way to call up from a > lower layer than that's obviously fine. Correct. The VFS upcalls to XFS on some event, then XFS queues the event data (or drops it) and waits for userspace to read the queued events. We're not directly invoking a helper program from deep in the guts, that's too wild even for me. ;) > Fwiw, have you considered building this on top of a fanotify extension > instead of inventing your own mechanism for this? I have, at various stages of this experiment. Originally, I was only going to export xfs-specific metadata events (e.g. this AG's inode btree index is bad) so that the userspace program (xfs_healer) could initiate a repair against the broken pieces. At the time I thought it would be fun to experiment with an anonfd file that emitted jsonp objects so that I could avoid the usual C struct ABI mess because json is easily parsed into key-value mapping objects in a lot of languages (that aren't C). It later turned out that formatting the json is rather more costly than I thought even with seq_bufs, so I added an alternate format that emits boring C structures. Having gone back to C structs, it would be possibly (and possibly quite nice) to migrate to fanotify so that I don't have to maintain a bunch of queuing code. But that can have its own drawbacks, as Ted and I discovered when we discussed his patches that pushed ext4 error events through fanotify: For filesystem metadata events, the fine details of representing that metadata in a generic interface gets really messy because each filesystem has a different design. To initiate a repair you need to know a lot of specifics: which AG has a bad structure, and what structure within that AG; or which file and what structure under that file, etc. Ted and Jan Kara and I tried to come up with a reasonably generic format for all that and didn't succeed; the best I could think of is: struct fanotify_event_info_fsmeta_error { struct fanotify_event_info_header hdr; __u32 mask; /* bitmask of objects */ __u32 what; /* union decoder */ union { struct { __u32 gno; /* shard number if applicable */ __u32 pad0[5]; }; struct { __u64 ino; /* affected file */ __u32 gen; __u32 pad1[3]; }; struct { __u64 diskaddr; /* device media error */ __u64 length; __u32 device; __u32 pad2; }; }; __u64 pad[2]; /* future expansion */ }; But now we have this gross struct with a union in the ABI, and what happens when someone wants to add support for a filesystem with even stranger stuff e.g. btrfs/bcachefs? We could punt in the generic header and do this instead: struct fanotify_event_info_fsmeta_error { struct fanotify_event_info_header hdr; __u32 fstype; /* same as statfs::f_type */ unsigned char data[]; /* good luck with this */ }; and now you just open-cast a pointer to the char array to whatever fs-specific format you want, but eeeuugh. The other reason for sticking with an anonfd (so far) is that the kernel side of xfs_healer is designed to maintain a soft reference to the xfs_mount object so that the userspace program need not maintain an open fd on the filesystem, because that prevents unmount. I aim to find a means for the magic healer fd to gain the ability to reopen the root directory of the filesystem so that the sysadmin running mount --move doesn't break the healer. I think fanotify fixes the "healer pins the mount" problems but I don't think there's a way to do the reopening thing. Getting back to the question that opened this thread -- I think regular file IO errors can be represented with a sequence of fanotify_event_metadata -> fanotify_event_info_fid -> fanotify_event_info_range -> fanotify_event_info_error objects in the fanotify stream. This format I think is easily standardized across filesystems and can be wired up from iomap without a lot of fuss. But I don't know how fsnotify event blob chaining works well enough to say for sure. :/ --D > > pagecache/vfs into the filesystem and out through even more XFS code, > > which has to go through its usual rigorous reviews. > > > > But if there's interest then I could move up the timeline on submitting > > those since I wasn't going to do much with any of that until 2026. > > > > --D > > > > > [0] https://lwn.net/Articles/724307/ > > > > > > -- > > > Regards > > > Yafang