On Mon, Jun 02, 2025 at 09:50:04PM -0700, Christoph Hellwig wrote: > On Tue, Jun 03, 2025 at 09:19:10AM +1000, Dave Chinner wrote: > > > In other words, write errors in Linux are in general expected to be > > > persistent, modulo explicit failfast requests like REQ_NOWAIT. > > > > Say what? the blk_errors array defines multiple block layer errors > > that are transient in nature - stuff like ENOSPC, ETIMEDOUT, EILSEQ, > > ENOLINK, EBUSY - all indicate a transient, retryable error occurred > > somewhere in the block/storage layers. > > Let's use the block layer codes reported all the way up to the file > systems and their descriptions instead of the errnos they are > mapped to for compatibility. The above would be in order: > > [BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" }, > [BLK_STS_TIMEOUT] = { -ETIMEDOUT, "timeout" }, > [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, > [BLK_STS_TRANSPORT] = { -ENOLINK, "recoverable transport" }, > [BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" }, > > > What is permanent about dm-thinp returning ENOSPC to a write > > request? Once the pool has been GC'd to free up space or expanded, > > the ENOSPC error goes away. > > Everything. ENOSPC means there is no space. There might be space in > the non-determinant future, but if the layer just needs to GC it must > not report the error. GC of thin pools requires the filesystem to be mounted so fstrim can be run to tell the thinp device where all the free LBA regions it can reclaim are located. If we shut down the filesystem instantly when the pool goes ENOSPC on a metadata write, then *we can't run fstrim* to free up unused space and hence allow that metadata write to succeed in the future. It should be obvious at this point that a filesystem shutdown on an ENOSPC error from the block device on anything other than journal IO is exactly the wrong thing to be doing. > > What is permanent about an IO failing with EILSEQ because a t10 > > checksum failed due to a random bit error detected between the HBA > > and the storage device? Retry the IO, and it goes through just fine > > without any failures. > > Normally it means your checksum was wrong. If you have bit errors > in the cable they will show up again, maybe not on the next I/O > but soon. But it's unlikely to be hit by another cosmic ray anytime soon, and so bit errors caused by completely random environmental events should -absolutely- be retried as the subsequent write retry will succeed. If there is a dodgy cable causing the problems, the error will re-occur on random IOs and we'll emit write errors to the log that monitoring software will pick up. If we are repeatedly isssuing write errors due to EILSEQ errors, then that's a sign the hardware needs replacing. There is no risk to filesystem integrity if write retries succeed, and that gives the admin time to schedule downtime to replace the dodgy hardware. That's much better behaviour than unexpected production system failure in the middle of the night... It is because we have robust and resilient error handling in the filesystem that the system is able to operate correctly in these marginal situations. Operating in marginal conditions or as hardware is beginning to fail is a necessary to keep production systems running until corrective action can be taken by the administrators. > > These transient error types typically only need a write retry after > > some time period to resolve, and that's what XFS does by default. > > What makes these sorts of errors persistent in the linux block layer > > and hence requiring an immediate filesystem shutdown and complete > > denial of service to the storage? > > > > I ask this seriously, because you are effectively saying the linux > > storage stack now doesn't behave the same as the model we've been > > using for decades. What has changed, and when did it change? > > Hey, you can retry. You're unlikely to improve the situation though > but instead just keep deferring the inevitable shutdown. Absolutely. That's the whole point - random failures won't repeat, and hence when they do occur we avoid a shutdown by retrying them on failure. This is -exactly- how robust error handling should work. However, for IO errors that persist or where other IO errors start to creep in, all the default behaviour is trying to do is hold the system up in a working state until downtime can be scheduled and the broken hardware is replaced. If integrity ends up being compromised by a subsequent IO failure, then we will shut the filesystem down at that point. This is about resilience in the face of errors. Not every error is fatal, nor does every error re-occur. There are classes of errors known to be transient (ENOSPC), others that are permanent (ENODEV), and others that we just don't know (EIO). If we value resiliency and robustness, then the filesystem should be able to withstand transient and "maybe-transient" IO failures without compromising integrity. Failing to recognise that transient and "maybe-transient" errors can generally be handled cleanly and successfully with future write retries leads to brittle, fragile systems that fall over at the first sign of anything going wrong. Filesystems that are targetted at high value production systems and/or running mission critical applications needs to have resilient and robust error handling. > > > Which also leaves me a bit puzzled what the XFS metadata retries are > > > actually trying to solve, especially without even having a corresponding > > > data I/O version. > > > > It's always been for preventing immediate filesystem shutdown when > > spurious transient IO errors occur below XFS. Data IO errors don't > > cause filesystem shutdowns - errors get propagated to the > > application - so there isn't a full system DOS potential for > > incorrect classification of data IO errors... > > Except as we see in this thread for a fairly common use case (buffered > I/O without fsync) they don't. And I agree with you that this is not > how you write applications that care about data integrity - but the > entire reset of the system and just about every common utility is > written that way. Yes, I know that. But there are still valid reasons for retrying failed async data writeback IO when it triggers a spurious or retriable IO error.... > And even applications that fsync won't see you fancy error code. The > only thing stored in the address_space for fsync to catch is EIO and > ENOSPC. The filesystem knows exactly what the IO error reported by the block layer is before we run folio completions, so we control exactly what we want to report as IO compeltion status. Hence the bogosities of error propagation to userspace via the mapping is completely irrelevant to this discussion/feature because it would be implemented below the layer that squashes the eventual IO errno into the address space... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx