Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Wed, 4 Jun 2025 21:51:39 -0700

On Thu, Jun 05, 2025 at 12:18:24PM +1000, Dave Chinner wrote:
> > How high are the chances that you hit exactly the rate metadata
> > writeback I/O and not journal or data I/O for this odd condition
> > that requires user interaction?
> 
> 100%.
> 
> We'll hit it with both data IO and metadata IO at the same time,
> but in the vast majority of cases we won't hit ENOSPC on journal IO.
> 
> Why? Because mkfs.xfs zeros the entire log via either
> FALLOC_FL_ZERO_RANGE or writing physical zeros. Hence a thin device
> always has a fully allocated log before the filesystem is first
> mounted and so ENOSPC to journal IO should never happen unless a
> device level snapshot is taken.
> 
> i.e. the only time the journal is not fully allocated in the block device
> is immediately after a block device snapshot is taken. The log needs
> to be written entirely once before it is fully allocated again, and
> this is the only point in time we will see ENOSPC on a thinp device
> for journal IO.

I guess that works for the very specific dm-thin case.  Not for anything
else that does actual out of place writes, though.

> > Where is this weird model where a
> > storage device returns an out of space error and manual user interaction
> > using manual and not online trim is going to fix even documented?
> 
> I explicitly said that the filesystem needs to remain online when
> the thin pool goes ENOSPC so that fstrim (the online filesystem trim
> utility) can be run to inform the thin pool exactly where all the
> free LBA address space is so it can efficiently free up pool space.
> 
> This is a standard procedure that people automate through things
> like udev scripts that capture the dm-thin pool low/no space
> events.
> 
> You seem to be trying to create a strawman here....

I'm not.  But you seem to be very focussed on the undocument and
in general a bit unusual dm-thin semantics.  If that's all you care
about fine, but state that.

> But what causes them is irrelevant - the fact is that they do occur,
> and we cannot know if it transient or persistent from a single IO
> context. Hence the only decision that can be made from IO completion
> context is "retry or fail this IO". We default to "retry" for
> metadata writeback because that automatically handles transient
> errors correctly.
> 
> IOWs, if it is actually broken hardware, then the fact we may retry
> individual failed IOs in a non-critical path is irrelevant. If the
> errors persistent and/or are widespread, then we will get an error
> in a critical path and shut down at that point. 

In general continuing when you have known errors is a bad idea
unless you specifically know retrying makes them better.  When you
are on PI-enabled hardware retrying that PI error (and that's what
we are talking about here) is very unlikely to just make things
better.

> > > It is because we have robust and resilient error handling in the
> > > filesystem that the system is able to operate correctly in these
> > > marginal situations. Operating in marginal conditions or as hardware
> > > is beginning to fail is a necessary to keep production systems
> > > running until corrective action can be taken by the administrators.
> > 
> > I'd really like to see a format writeup of your theory of robust error
> > handling where that robustness is centered around the fairly rare
> > case of metadata writeback and applications dealing with I/O errors,
> > while journal write errors and read error lead to shutdown.
> 
> .... and there's the strawman argument, and a demand for formal
> proofs as the only way to defend against your argument.

No.  You claim that "we have robust and resilient error handling in the
filesystem".  It's pretty clear from the code and the discussion that
we do not.  If you insist that we do I'd rather see a good proof of
that.

> I think you are being intentionally obtuse, Christoph. I wrote this
> for XFS back in *2008*:

Which as you later state yourself is irrelevant to this discussion.

> The point I am making that is that the entire architecture of the
> current V5 on-disk format, the verification architecture and the
> scrub/online repair infrastructure was very much based on the
> storage device model that *IO errors may be transient*.

Except that as we've clearly seen in this thread in practice it
does not.  We have a way to retry the asynchronous metadata writeback,
apparently designed to deal with an undocumented dm-thin use case,
but everything else is handwaiving.

> > What known transient errors do you think XFS (or any other file system)
> > actually handles properly?  Where is the contract that these errors
> > actually are transient.
> 
> Nope, I'm not going to play the "I demand that you prove the
> behaviour that has existed in XFS for over 30 years is correct",
> Christoph.
> 
> If you want to change the underlying IO error handling model that
> XFS has been based on since it was first designed back in the 1990s,
> then it's on you to prove to every filesystem developer that IO
> errors reported from the block layer can *never be transient*.

I'm not changing anything.  I'm just challenging your opinion that
all this has been handled forver.  And it's pretty clear that it
is not.  So I really object to you spreading this untrue claims
without anything top back them up.

Maybe you want to handle transient errors, and that's fine.  But
that aspirational.

> Really, though, I don't know why you think that transient errors
> don't exist anymore, nor why you are demanding that I prove that
> they do when it is abundantly clear that ENOSPC from dm-thin can
> definitely be a transient error.
> 
> Perhaps you can provide some background on why you are asserting
> that there is no such thing as a transient IO error so we can all
> start from a common understanding?

Oh, there absolutely are transient I/O errors.  But in the Linux I/O
stack they are handled in general below the file system.  Look at SCSI
error handling, the NVMe retry mechanisms, or the multipath drivers.  All
of them do handle transient errors in a usually more or less well
understood and well tested fashion.  But except for the retries of
asynchronous metadata buffer writeback in XFS basically nothing in the
commonly used file systems handles transient errors, exactly because that
is not the layering works.  If we want to change that we'd better
understand what the use case for that is and how we properly test it.