Re: ext4 metadata corruption - snapshot related?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We updated a machine to a newer 6.15.2-1.el8.elrepo.x86_64 kernel, and the same? bug reoccurred after some time:

The error was the following:
Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 44997932 Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 44997932 Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 44997932 Jul 02 11:04:03 xxxxx kernel: EXT4-fs error (device sdd1): ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 44997932

Any idea's on how this could be debugged further?

Thanks
Jean-Louis

On 12/06/2025 16:43, Jean-Louis Dupond wrote:
Hi,

We have around 200 VM's running on qemu (on a AlmaLinux 9 based hypervisor).
All those VM's are migrated from physical machines recently.

But when we enable backups on those VM's (which triggers snapshots), we notice some weird/random ext4 corruption within the VM itself.
The VM itself runs CloudLinux 8 (4.18.0-553.40.1.lve.el8.x86_64 kernel).

This are some examples of corruption we see:
1)
kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1036: inode #19280823: comm lsphp: Directory block failed checksum kernel: EXT4-fs error (device sdc1): ext4_empty_dir:2801: inode #19280823: comm lsphp: Directory block failed checksum kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1036: inode #19280820: comm lsphp: Directory block failed checksum kernel: EXT4-fs error (device sdc1): ext4_empty_dir:2801: inode #19280820: comm lsphp: Directory block failed checksum

2)
kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode #49419787: comm lsphp: deleted inode referenced: 49422454 kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode #49419787: comm lsphp: deleted inode referenced: 49422454 kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode #49419787: comm lsphp: deleted inode referenced: 49422454

3)
kernel: EXT4-fs error (device sdb1): ext4_validate_block_bitmap:384: comm kworker/u240:3: bg 308: bad block bitmap checksum kernel: EXT4-fs (sdb1): Delayed block allocation failed for inode 2513946 at logical offset 2 with max blocks 1 with error 74
kernel: EXT4-fs (sdb1): This should not happen!! Data will be lost
kernel: EXT4-fs (sdb1): Inode 2513946 (00000000265d63ca): i_reserved_data_blocks (1) not cleared!
kernel: EXT4-fs (sdb1): error count since last fsck: 1
kernel: EXT4-fs (sdb1): initial error at time 1747923211: ext4_validate_block_bitmap:384 kernel: EXT4-fs (sdb1): last error at time 1747923211: ext4_validate_block_bitmap:384
kernel: EXT4-fs (sdb1): error count since last fsck: 1
kernel: EXT4-fs (sdb1): initial error at time 1747923211: ext4_validate_block_bitmap:384 kernel: EXT4-fs (sdb1): last error at time 1747923211: ext4_validate_block_bitmap:384

4)
kernel: EXT4-fs (sdc1): error count since last fsck: 4
kernel: EXT4-fs (sdc1): initial error at time 1746616017: ext4_validate_block_bitmap:384 kernel: EXT4-fs (sdc1): last error at time 1746621676: ext4_mb_generate_buddy:808


Now as a test we upgraded to some newer (backported) kernel, more specificly: 5.14.0-284.1101
And after doing some backups again, we had another error:

kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: inode #34752060: comm tar: Directory block failed checksum kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: inode #34752232: comm tar: No space for directory leaf checksum. Please run e2fsck -D. kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: inode #34752232: comm tar: Directory block failed checksum kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: inode #34752064: comm tar: No space for directory leaf checksum. Please run e2fsck -D. kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: inode #34752064: comm tar: Directory block failed checksum kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: inode #34752167: comm tar: No space for directory leaf checksum. Please run e2fsck -D. kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: inode #34752167: comm tar: Directory block failed checksum


So now we are wondering what could cause this corruption here.
- We have more VM's on the same kind of setup, without seeing any corruption. The only difference there is that the VM's are running Debian, have smaller disks and not doing quota.
- If we disable backups/snapshots, no corruption is observed
- Even if we disable the qemu-guest-agent (so no fsfreeze is executed), the corruption still occurs

We (for now at least) only see the corruption on filesystems where quota is enabled (both usrjquota and usrquota).
The filesystems are between 600GB and 2TB.
And today I noticed (as the filesystems are resized during setup), the journal size is only 64M (could this potentially be an issue?).

The big question in the whole story here is, could it be an in-guest (ext4?) bug/issue? Or do we really need to look into the layer below (aka qemu/hypervisor). Or if somebody has other idea's, feel free to share! Also additional things that could help to troubleshoot the issue.

Thanks
Jean-Louis




[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux