Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 30 May 2025 08:38:47 -0700

On Fri, May 30, 2025 at 07:17:00AM +0200, Christian Brauner wrote:
> On Wed, May 28, 2025 at 09:25:50PM -0700, Darrick J. Wong wrote:
> > On Thu, May 29, 2025 at 10:50:01AM +0800, Yafang Shao wrote:
> > > Hello,
> > > 
> > > Recently, we encountered data loss when using XFS on an HDD with bad
> > > blocks. After investigation, we determined that the issue was related
> > > to writeback errors. The details are as follows:
> > > 
> > > 1. Process-A writes data to a file using buffered I/O and completes
> > > without errors.
> > > 2. However, during the writeback of the dirtied pagecache pages, an
> > > I/O error occurs, causing the data to fail to reach the disk.
> > > 3. Later, the pagecache pages may be reclaimed due to memory pressure,
> > > since they are already clean pages.
> > > 4. When Process-B reads the same file, it retrieves zeroed data from
> > > the bad blocks, as the original data was never successfully written
> > > (IOMAP_UNWRITTEN).
> > > 
> > > We reviewed the related discussion [0] and confirmed that this is a
> > > known writeback error issue. While using fsync() after buffered
> > > write() could mitigate the problem, this approach is impractical for
> > > our services.
> > > 
> > > Instead, we propose introducing configurable options to notify users
> > > of writeback errors immediately and prevent further operations on
> > > affected files or disks. Possible solutions include:
> > > 
> > > - Option A: Immediately shut down the filesystem upon writeback errors.
> > > - Option B: Mark the affected file as inaccessible if a writeback error occurs.
> > > 
> > > These options could be controlled via mount options or sysfs
> > > configurations. Both solutions would be preferable to silently
> > > returning corrupted data, as they ensure users are aware of disk
> > > issues and can take corrective action.
> > > 
> > > Any suggestions ?
> > 
> > Option C: report all those write errors (direct and buffered) to a
> > daemon and let it figure out what it wants to do:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2025-05-21
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring-rust_2025-05-21
> > 
> > Yes this is a long term option since it involves adding upcalls from the
> 
> I hope you don't mean actual usermodehelper upcalls here because we
> should not add any new ones. If you just mean a way to call up from a
> lower layer than that's obviously fine.

Correct.  The VFS upcalls to XFS on some event, then XFS queues the
event data (or drops it) and waits for userspace to read the queued
events.  We're not directly invoking a helper program from deep in the
guts, that's too wild even for me. ;)

> Fwiw, have you considered building this on top of a fanotify extension
> instead of inventing your own mechanism for this?

I have, at various stages of this experiment.

Originally, I was only going to export xfs-specific metadata events
(e.g. this AG's inode btree index is bad) so that the userspace program
(xfs_healer) could initiate a repair against the broken pieces.

At the time I thought it would be fun to experiment with an anonfd file
that emitted jsonp objects so that I could avoid the usual C struct ABI
mess because json is easily parsed into key-value mapping objects in a
lot of languages (that aren't C).  It later turned out that formatting
the json is rather more costly than I thought even with seq_bufs, so I
added an alternate format that emits boring C structures.

Having gone back to C structs, it would be possibly (and possibly quite
nice) to migrate to fanotify so that I don't have to maintain a bunch of
queuing code.  But that can have its own drawbacks, as Ted and I
discovered when we discussed his patches that pushed ext4 error events
through fanotify:

For filesystem metadata events, the fine details of representing that
metadata in a generic interface gets really messy because each
filesystem has a different design.  To initiate a repair you need to
know a lot of specifics: which AG has a bad structure, and what
structure within that AG; or which file and what structure under that
file, etc.  Ted and Jan Kara and I tried to come up with a reasonably
generic format for all that and didn't succeed; the best I could think
of is:

struct fanotify_event_info_fsmeta_error {
	struct fanotify_event_info_header hdr;

	__u32 mask;	/* bitmask of objects */
	__u32 what;	/* union decoder */
	union {
		struct {
			__u32 gno;	/* shard number if applicable */
			__u32 pad0[5];
		};
		struct {
			__u64 ino;	/* affected file */
			__u32 gen;
			__u32 pad1[3];
		};
		struct {
			__u64 diskaddr;	/* device media error */
			__u64 length;
			__u32 device;
			__u32 pad2;
		};
	};

	__u64 pad[2];	/* future expansion */
};

But now we have this gross struct with a union in the ABI, and what
happens when someone wants to add support for a filesystem with even
stranger stuff e.g. btrfs/bcachefs?  We could punt in the generic header
and do this instead:

struct fanotify_event_info_fsmeta_error {
	struct fanotify_event_info_header hdr;

	__u32 fstype;		/* same as statfs::f_type */
	unsigned char data[];	/* good luck with this */
};

and now you just open-cast a pointer to the char array to whatever
fs-specific format you want, but eeeuugh.

The other reason for sticking with an anonfd (so far) is that the kernel
side of xfs_healer is designed to maintain a soft reference to the
xfs_mount object so that the userspace program need not maintain an open
fd on the filesystem, because that prevents unmount.  I aim to find a
means for the magic healer fd to gain the ability to reopen the root
directory of the filesystem so that the sysadmin running mount --move
doesn't break the healer.

I think fanotify fixes the "healer pins the mount" problems but I don't
think there's a way to do the reopening thing.

Getting back to the question that opened this thread -- I think regular
file IO errors can be represented with a sequence of
fanotify_event_metadata -> fanotify_event_info_fid ->
fanotify_event_info_range -> fanotify_event_info_error objects in the
fanotify stream.  This format I think is easily standardized across
filesystems and can be wired up from iomap without a lot of fuss.  But I
don't know how fsnotify event blob chaining works well enough to say for
sure. :/

--D

> > pagecache/vfs into the filesystem and out through even more XFS code,
> > which has to go through its usual rigorous reviews.
> > 
> > But if there's interest then I could move up the timeline on submitting
> > those since I wasn't going to do much with any of that until 2026.
> > 
> > --D
> > 
> > > [0] https://lwn.net/Articles/724307/
> > > 
> > > -- 
> > > Regards
> > > Yafang