On Tue, Jun 03, 2025 at 11:33:05PM -0700, Christoph Hellwig wrote: > On Wed, Jun 04, 2025 at 08:05:03AM +1000, Dave Chinner wrote: > > > > > > Everything. ENOSPC means there is no space. There might be space in > > > the non-determinant future, but if the layer just needs to GC it must > > > not report the error. > > > > GC of thin pools requires the filesystem to be mounted so fstrim can > > be run to tell the thinp device where all the free LBA regions it > > can reclaim are located. If we shut down the filesystem instantly > > when the pool goes ENOSPC on a metadata write, then *we can't run > > fstrim* to free up unused space and hence allow that metadata write > > to succeed in the future. > > > > It should be obvious at this point that a filesystem shutdown on an > > ENOSPC error from the block device on anything other than journal IO > > is exactly the wrong thing to be doing. > > How high are the chances that you hit exactly the rate metadata > writeback I/O and not journal or data I/O for this odd condition > that requires user interaction? 100%. We'll hit it with both data IO and metadata IO at the same time, but in the vast majority of cases we won't hit ENOSPC on journal IO. Why? Because mkfs.xfs zeros the entire log via either FALLOC_FL_ZERO_RANGE or writing physical zeros. Hence a thin device always has a fully allocated log before the filesystem is first mounted and so ENOSPC to journal IO should never happen unless a device level snapshot is taken. i.e. the only time the journal is not fully allocated in the block device is immediately after a block device snapshot is taken. The log needs to be written entirely once before it is fully allocated again, and this is the only point in time we will see ENOSPC on a thinp device for journal IO. Because the log IO is sequential, and the log is circular, there is no write or allocation amplification here and once the log has been written once further writes are simply overwriting allocated LBA space. Hence after a short period of time of activity after a snapshot, ENOSPC from journal IO is no longer a possibility. This case is the exception rather than common behaviour. Metadata writeback is a different story altogether. When we allocate and write back metadata for the first time (either after mkfs, fstrim or a device snapshot) or overwrite existing metadata after a snapshot, the metadata writeback IO will always require device side space allocation. Unlike the neat sequential journal IO, metadata writeback is effectively random small write IO. This triggers worse case allocation amplification on thinp devices, as well as worst case write amplification in the case of COW after a snapshot. metadata writeback - especially overwrite after snapshot + modification - is the worst possible write pattern for thinp devices. It is not unusual to see dm-thin devices with a 64kB block size have allocation and write amplification factors of 15-16 on 4kB block size filesystems after a snapshot as every single random metadata overwrite will now trigger a 64kB COW in the dm-thin device to break blocks shared between snapshots. So, yes, metadata writeback is extremely prone to triggering ENOSPC from thin devices, whilst journal IO almost never triggers it. > Where is this weird model where a > storage device returns an out of space error and manual user interaction > using manual and not online trim is going to fix even documented? I explicitly said that the filesystem needs to remain online when the thin pool goes ENOSPC so that fstrim (the online filesystem trim utility) can be run to inform the thin pool exactly where all the free LBA address space is so it can efficiently free up pool space. This is a standard procedure that people automate through things like udev scripts that capture the dm-thin pool low/no space events. You seem to be trying to create a strawman here.... > > > Normally it means your checksum was wrong. If you have bit errors > > > in the cable they will show up again, maybe not on the next I/O > > > but soon. > > > > But it's unlikely to be hit by another cosmic ray anytime soon, and > > so bit errors caused by completely random environmental events > > should -absolutely- be retried as the subsequent write retry will > > succeed. > > > > If there is a dodgy cable causing the problems, the error will > > re-occur on random IOs and we'll emit write errors to the log that > > monitoring software will pick up. If we are repeatedly isssuing write > > errors due to EILSEQ errors, then that's a sign the hardware needs > > replacing. > > Umm, all the storage protocols do have pretty good checksums. The strength of the checksum is irrelevant. It's what we do when it detects a bit error that is being discussed. > A cosmic > ray isn't going to fail them it is something more fundamental like > broken hardware or connections. In other words you are going to see > this again and again pretty frequently. I've seen plenty of one-off, unexplainable, unreproducable IO errors because of random bit errors over the past 20+ years. But what causes them is irrelevant - the fact is that they do occur, and we cannot know if it transient or persistent from a single IO context. Hence the only decision that can be made from IO completion context is "retry or fail this IO". We default to "retry" for metadata writeback because that automatically handles transient errors correctly. IOWs, if it is actually broken hardware, then the fact we may retry individual failed IOs in a non-critical path is irrelevant. If the errors persistent and/or are widespread, then we will get an error in a critical path and shut down at that point. This means the architecture is naturally resilient against transient write errors, regardless of their cause. We want XFS to resilient; we do not want it to be brittle or fragile in environments that are slightly less than perfect, unless that is the way the admin wants it to behave. We just the admin the option to choose how their filesystems respond to such errors, but we default to the most resilient settings for everyone else. > > There is no risk to filesystem integrity if write retries > > succeed, and that gives the admin time to schedule downtime to > > replace the dodgy hardware. That's much better behaviour than > > unexpected production system failure in the middle of the night... > > > > It is because we have robust and resilient error handling in the > > filesystem that the system is able to operate correctly in these > > marginal situations. Operating in marginal conditions or as hardware > > is beginning to fail is a necessary to keep production systems > > running until corrective action can be taken by the administrators. > > I'd really like to see a format writeup of your theory of robust error > handling where that robustness is centered around the fairly rare > case of metadata writeback and applications dealing with I/O errors, > while journal write errors and read error lead to shutdown. .... and there's the strawman argument, and a demand for formal proofs as the only way to defend against your argument. > Maybe > I'm missing something important, but the theory does not sound valid, > and we don't have any testing framework that actually verifies it. I think you are being intentionally obtuse, Christoph. I wrote this for XFS back in *2008*: https://web.archive.org/web/20140907100223/http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption The "exception handling" section is probably appropriate here, but whilst the contents are not directly about this particular discussion, the point is that we've always considered there to be types of IO errors that are transient in nature. I will quote part of that section: "Furthermore, the storage subsystem plays a part in deciding how to handle errors. The reason is that in many storage configurations I/O errors can be transient. For example, in a SAN a broken fibre can cause a failover to a redundant path, however the inflight I/O on the failed path is usually timed out and an error returned. We don't want to shut down the filesystem on such an error - we want to wait for failover to a redundant path and then retry the I/O. If the failover succeeds, then the I/O will succeed. Hence any robust method of exception handling needs to consider that I/O exceptions may be transient. " The point I am making that is that the entire architecture of the current V5 on-disk format, the verification architecture and the scrub/online repair infrastructure was very much based on the storage device model that *IO errors may be transient*. > > > Failing to recognise that transient and "maybe-transient" errors can > > generally be handled cleanly and successfully with future write > > retries leads to brittle, fragile systems that fall over at the > > first sign of anything going wrong. Filesystems that are targetted > > at high value production systems and/or running mission critical > > applications needs to have resilient and robust error handling. > > What known transient errors do you think XFS (or any other file system) > actually handles properly? Where is the contract that these errors > actually are transient. Nope, I'm not going to play the "I demand that you prove the behaviour that has existed in XFS for over 30 years is correct", Christoph. If you want to change the underlying IO error handling model that XFS has been based on since it was first designed back in the 1990s, then it's on you to prove to every filesystem developer that IO errors reported from the block layer can *never be transient*. Indeed, please provide us with the "contract" that says block devices and storage devices are not allowed to expose transient IO errors to higher layers. Then you need show that ENOSPC from a dm-thin device is *forever*, and never goes away, and justify that behaviour as being in the best interests of users despite the ease of pool expansion to make ENOSPC go away..... It is on you to prove that the existing model is wrong and needs fixing, not for us to prove to you that the existing model is correct. > > > And even applications that fsync won't see you fancy error code. The > > > only thing stored in the address_space for fsync to catch is EIO and > > > ENOSPC. > > > > The filesystem knows exactly what the IO error reported by the block > > layer is before we run folio completions, so we control exactly what > > we want to report as IO compeltion status. > > Sure, you could invent a scheme to propagate the exaxct error. For > direct I/O we even return the exact error to userspace. But that > means we actually have a definition of what each error means, and how > it could be handled. None of that exists right now. We could do > all this, but that assumes you actually have: > > a) a clear definition of a problem > b) a good way to fix that problem > c) good testing infrastructure to actually test it, because without > that all good intentions will probably cause more problems than > they solve > > > Hence the bogosities of error propagation to userspace via the > > mapping is completely irrelevant to this discussion/feature because > > it would be implemented below the layer that squashes the eventual > > IO errno into the address space... > > How would implement and test all this? And for what use case? I don't care, it's not my problem to solve, and I don't care if nothing comes of it. A fellow developer asked for advice, I simply suggested following an existing model we already have infrastructure for. Now you are demanding that I prove the existing decades old model is valid, and then tell you how to solve the OG's problem and make it all work. None of this is my problem, regardless of how much you try to make it so. Really, though, I don't know why you think that transient errors don't exist anymore, nor why you are demanding that I prove that they do when it is abundantly clear that ENOSPC from dm-thin can definitely be a transient error. Perhaps you can provide some background on why you are asserting that there is no such thing as a transient IO error so we can all start from a common understanding? -Dave. -- Dave Chinner david@xxxxxxxxxxxxx