Re: [QUESTION] xfs, iomap: Handle writeback errors to prevent silent data corruption

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 5 Jun 2025 12:18:24 +1000

On Tue, Jun 03, 2025 at 11:33:05PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 04, 2025 at 08:05:03AM +1000, Dave Chinner wrote:
> > > 
> > > Everything.  ENOSPC means there is no space.  There might be space in
> > > the non-determinant future, but if the layer just needs to GC it must
> > > not report the error.
> > 
> > GC of thin pools requires the filesystem to be mounted so fstrim can
> > be run to tell the thinp device where all the free LBA regions it
> > can reclaim are located. If we shut down the filesystem instantly
> > when the pool goes ENOSPC on a metadata write, then *we can't run
> > fstrim* to free up unused space and hence allow that metadata write
> > to succeed in the future.
> > 
> > It should be obvious at this point that a filesystem shutdown on an
> > ENOSPC error from the block device on anything other than journal IO
> > is exactly the wrong thing to be doing.
> 
> How high are the chances that you hit exactly the rate metadata
> writeback I/O and not journal or data I/O for this odd condition
> that requires user interaction?

100%.

We'll hit it with both data IO and metadata IO at the same time,
but in the vast majority of cases we won't hit ENOSPC on journal IO.

Why? Because mkfs.xfs zeros the entire log via either
FALLOC_FL_ZERO_RANGE or writing physical zeros. Hence a thin device
always has a fully allocated log before the filesystem is first
mounted and so ENOSPC to journal IO should never happen unless a
device level snapshot is taken.

i.e. the only time the journal is not fully allocated in the block device
is immediately after a block device snapshot is taken. The log needs
to be written entirely once before it is fully allocated again, and
this is the only point in time we will see ENOSPC on a thinp device
for journal IO.

Because the log IO is sequential, and the log is circular, there is
no write or allocation amplification here and once the log has been
written once further writes are simply overwriting allocated LBA
space. Hence after a short period of time of activity after a
snapshot, ENOSPC from journal IO is no longer a possibility. This
case is the exception rather than common behaviour.

Metadata writeback is a different story altogether.

When we allocate and write back metadata for the first time (either
after mkfs, fstrim or a device snapshot) or overwrite existing
metadata after a snapshot, the metadata writeback IO will
always require device side space allocation.

Unlike the neat sequential journal IO, metadata writeback is
effectively random small write IO. This triggers worse case
allocation amplification on thinp devices, as well as worst case
write amplification in the case of COW after a snapshot. metadata
writeback - especially overwrite after snapshot + modification - is
the worst possible write pattern for thinp devices.

It is not unusual to see dm-thin devices with a 64kB block size have
allocation and write amplification factors of 15-16 on 4kB block
size filesystems after a snapshot as every single random metadata
overwrite will now trigger a 64kB COW in the dm-thin device to break
blocks shared between snapshots.

So, yes, metadata writeback is extremely prone to triggering ENOSPC
from thin devices, whilst journal IO almost never triggers it.

> Where is this weird model where a
> storage device returns an out of space error and manual user interaction
> using manual and not online trim is going to fix even documented?

I explicitly said that the filesystem needs to remain online when
the thin pool goes ENOSPC so that fstrim (the online filesystem trim
utility) can be run to inform the thin pool exactly where all the
free LBA address space is so it can efficiently free up pool space.

This is a standard procedure that people automate through things
like udev scripts that capture the dm-thin pool low/no space
events.

You seem to be trying to create a strawman here....

> > > Normally it means your checksum was wrong.  If you have bit errors
> > > in the cable they will show up again, maybe not on the next I/O
> > > but soon.
> > 
> > But it's unlikely to be hit by another cosmic ray anytime soon, and
> > so bit errors caused by completely random environmental events
> > should -absolutely- be retried as the subsequent write retry will
> > succeed.
> >
> > If there is a dodgy cable causing the problems, the error will
> > re-occur on random IOs and we'll emit write errors to the log that
> > monitoring software will pick up. If we are repeatedly isssuing write
> > errors due to EILSEQ errors, then that's a sign the hardware needs
> > replacing.
> 
> Umm, all the storage protocols do have pretty good checksums.

The strength of the checksum is irrelevant. It's what we do when
it detects a bit error that is being discussed.

> A cosmic
> ray isn't going to fail them it is something more fundamental like
> broken hardware or connections. In other words you are going to see
> this again and again pretty frequently.

I've seen plenty of one-off, unexplainable, unreproducable IO
errors because of random bit errors over the past 20+ years.

But what causes them is irrelevant - the fact is that they do occur,
and we cannot know if it transient or persistent from a single IO
context. Hence the only decision that can be made from IO completion
context is "retry or fail this IO". We default to "retry" for
metadata writeback because that automatically handles transient
errors correctly.

IOWs, if it is actually broken hardware, then the fact we may retry
individual failed IOs in a non-critical path is irrelevant. If the
errors persistent and/or are widespread, then we will get an error
in a critical path and shut down at that point. 

This means the architecture is naturally resilient against transient
write errors, regardless of their cause.  We want XFS to resilient;
we do not want it to be brittle or fragile in environments that are
slightly less than perfect, unless that is the way the admin wants
it to behave. We just the admin the option to choose how their
filesystems respond to such errors, but we default to the most
resilient settings for everyone else.

> > There is no risk to filesystem integrity if write retries
> > succeed, and that gives the admin time to schedule downtime to
> > replace the dodgy hardware. That's much better behaviour than
> > unexpected production system failure in the middle of the night...
> > 
> > It is because we have robust and resilient error handling in the
> > filesystem that the system is able to operate correctly in these
> > marginal situations. Operating in marginal conditions or as hardware
> > is beginning to fail is a necessary to keep production systems
> > running until corrective action can be taken by the administrators.
> 
> I'd really like to see a format writeup of your theory of robust error
> handling where that robustness is centered around the fairly rare
> case of metadata writeback and applications dealing with I/O errors,
> while journal write errors and read error lead to shutdown.

.... and there's the strawman argument, and a demand for formal
proofs as the only way to defend against your argument.

> Maybe
> I'm missing something important, but the theory does not sound valid,
> and we don't have any testing framework that actually verifies it.

I think you are being intentionally obtuse, Christoph. I wrote this
for XFS back in *2008*:

https://web.archive.org/web/20140907100223/http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption

The "exception handling" section is probably appropriate here,
but whilst the contents are not directly about this particular
discussion, the point is that we've always considered there to be
types of IO errors that are transient in nature. I will quote part
of that section:

"Furthermore, the storage subsystem plays a part in deciding how to
handle errors. The reason is that in many storage configurations I/O
errors can be transient. For example, in a SAN a broken fibre can
cause a failover to a redundant path, however the inflight I/O on
the failed path is usually timed out and an error returned. We don't want
to shut down the filesystem on such an error - we want to wait for
failover to a redundant path and then retry the I/O. If the failover
succeeds, then the I/O will succeed. Hence any robust method of
exception handling needs to consider that I/O exceptions may be
transient. "

The point I am making that is that the entire architecture of the
current V5 on-disk format, the verification architecture and the
scrub/online repair infrastructure was very much based on the
storage device model that *IO errors may be transient*.

> 
> > Failing to recognise that transient and "maybe-transient" errors can
> > generally be handled cleanly and successfully with future write
> > retries leads to brittle, fragile systems that fall over at the
> > first sign of anything going wrong. Filesystems that are targetted
> > at high value production systems and/or running mission critical
> > applications needs to have resilient and robust error handling.
> 
> What known transient errors do you think XFS (or any other file system)
> actually handles properly?  Where is the contract that these errors
> actually are transient.

Nope, I'm not going to play the "I demand that you prove the
behaviour that has existed in XFS for over 30 years is correct",
Christoph.

If you want to change the underlying IO error handling model that
XFS has been based on since it was first designed back in the 1990s,
then it's on you to prove to every filesystem developer that IO
errors reported from the block layer can *never be transient*.

Indeed, please provide us with the "contract" that says block
devices and storage devices are not allowed to expose transient IO
errors to higher layers.

Then you need show that ENOSPC from a dm-thin device is *forever*,
and never goes away, and justify that behaviour as being in the best
interests of users despite the ease of pool expansion to make ENOSPC
go away.....

It is on you to prove that the existing model is wrong and needs
fixing, not for us to prove to you that the existing model is
correct.

> > > And even applications that fsync won't see you fancy error code.  The
> > > only thing stored in the address_space for fsync to catch is EIO and
> > > ENOSPC.
> > 
> > The filesystem knows exactly what the IO error reported by the block
> > layer is before we run folio completions, so we control exactly what
> > we want to report as IO compeltion status.
> 
> Sure, you could invent a scheme to propagate the exaxct error.  For
> direct I/O we even return the exact error to userspace.  But that
> means we actually have a definition of what each error means, and how
> it could be handled.  None of that exists right now.  We could do
> all this, but that assumes you actually have:
> 
>  a) a clear definition of a problem
>  b) a good way to fix that problem
>  c) good testing infrastructure to actually test it, because without
>     that all good intentions will probably cause more problems than
>     they solve
> 
> > Hence the bogosities of error propagation to userspace via the
> > mapping is completely irrelevant to this discussion/feature because
> > it would be implemented below the layer that squashes the eventual
> > IO errno into the address space...
> 
> How would implement and test all this?  And for what use case?

I don't care, it's not my problem to solve, and I don't care if
nothing comes of it.

A fellow developer asked for advice, I simply suggested following an
existing model we already have infrastructure for. Now you are
demanding that I prove the existing decades old model is valid, and
then tell you how to solve the OG's problem and make it all work.

None of this is my problem, regardless of how much you try to make
it so.

Really, though, I don't know why you think that transient errors
don't exist anymore, nor why you are demanding that I prove that
they do when it is abundantly clear that ENOSPC from dm-thin can
definitely be a transient error.

Perhaps you can provide some background on why you are asserting
that there is no such thing as a transient IO error so we can all
start from a common understanding?

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx