On Thu, Jun 26, 2025 at 11:56:44PM -0700, Andi Kleen wrote: > Hi, > > I have a spinning disk with XFS that corrupted a sector containing some inodes. > Reading it always gave a IO error (ENODATA). > > xfs_repair unfortunately couldn't handle this at all, running into this > gem: > > if (process_inode_chunk(mp, agno, num_inos, first_ino_rec, > ino_discovery, check_dups, extra_attr_check, > &bogus)) { > /* XXX - i/o error, we've got a problem */ > abort(); > } > > TBH I was a bit shocked that XFS repair doesn't handle IO errors. > Surely that's a common occurrence? Hi Andi. This behavior is well documented on xfs_repair man page: " Disk Errors xfs_repair aborts on most disk I/O errors. Therefore, if you are trying to repair a filesystem that was damaged due to a disk drive failure, steps should be taken to ensure that all blocks in the filesystem are readable and writable before attempting to use xfs_repair to repair the filesystem. A possible method is using dd(8) to copy the data onto a good disk. " I don't think IO errors could be classified as a common occurrence. > > Anyways, what I ended up doing was to use strace to get the seek offset > of the bad sector and then write a little python program to clear the block > (which then likely got remapped, or simply rewritten on the medium), > and apart from a few lost inodes everything was fine. > > It seems that xfs_repair should have an option to clear erroring blocks that > it encounters? I realize that this option could be dangerous, but in many cases > it would seem like the only way to recover. I believe one of the problems is xfsprogs can't really pinpoint what happened. Could be a transient failure due a link problem or a bad block on disk, or whatever else. So it has been designed to bail out and let the admin handle it. IMO Adding an option to force to 'clear errored blocks', which basically means forcing a write() on the block so that it could possibly be relocated by the disk's firmware is not a good strategy. Depending how many bad sectors are in the disk, or the nature of the IO error, this would might end up damaging the filesystem beyond recovery, as you mentioned yourself. So, in some cases, you either gotta try to force the disk to relocate the block manually or copy the still not bad data somewhere else, both achievable with `dd` for example. > > Or at a minimum print the seek offset on an error so that it can be cleared manually. > This seems weird. If xfs bailed where you pointed, calling process_inode_chunk(), this likely bailed from here: if (error) { do_warn(_("cannot read inode %" PRIu64 ", disk block %" PRId64 ", cnt %d\n"), XFS_AGINO_TO_INO(mp, agno, first_irec->ino_startnum), XFS_AGB_TO_DADDR(mp, agno, agbno), XFS_FSB_TO_BB(mp, M_IGEO(mp)->blocks_per_cluster)); while (bp_index > 0) { bp_index--; libxfs_buf_relse(bplist[bp_index]); } free(bplist); return(1); } process_inode_chunk() was supposed to log the inode and disk block, perhaps the abort() prevented the stderr buffer to be flushed, do you still have the whole xfs_repair output to the point where it failed?