BUG report: an ext4 data corruption issue in nojournal mode

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Fri, 12 Sep 2025 11:28:13 +0800

Hi, all!

Gao Xiang recently discovered a data corruption issue in **nojournal**
mode. After analysis, we found that the problem is after a metadata
block is freed, it can be immediately reallocated as a data block.
However, the metadata on this block may still be in the process of being
written back, which means the new data in this block could potentially
be overwritten by the stale metadata.

When releasing a metadata block, ext4_forget() calls bforget() in
nojournal mode, which clears the dirty flag on the buffer_head. If the
metadata has not yet started to be written back at this point, there is
no issue. However, if the write-back has already begun but the I/O has
not yet completed, ext4_forget() will have no effect, and the subsequent
ext4_mb_clear_bb() will immediately return the block to the mb
allocator. This block can then be immediately reallocated, potentially
triggering a data loss issue.

This issue is somewhat related to this patch set[1] that have been
merged. Before this patch set, clean_bdev_aliases() and
clean_bdev_bh_alias() could ensure that the dirty flag of the block
device buffer was cleared and the write-back was completed before using
newly allocated blocks in most cases. However, this patch set have fixed
a similar issues in journal mode and removed this safeguard because it's
fragile and misses some corner cases[2], increasing the likelihood of
triggering this issue.

Furthermore, I found that this issue theoretically still appears to
persist even in **ordered** journal mode. In the final else branch of
jbd2_journal_forget(), if the metadata block to be released is also
undergoing a write-back, jbd2_journal_forget() will add this buffer to
the current transaction for forgetting. Once the current transaction is
committed, the block can then be reallocated. However, there is no
guarantee that the ongoing I/O will complete. Typically, the undergoing
metadata writeback I/O does not take this long to complete, but it might
be throttled by the block layer or delayed due to anomalies in some slow
I/O processes in the underlying devices. Therefore, although it is
difficult to trigger, it theoretically still exists.

Consider the fix for now. In the **ordered** journal mode, I suppose we
can add a wait_on_buffer() during the process of the freed buffer in
jbd2_journal_commit_transaction(). This should not significantly impact
performance. In **nojorunal** mode, I do not want to reintroduce
clean_bdev_aliases(). One approach is to add wait_on_buffer() in
__ext4_forget(), but I am concerned that this might impact performance.
However, it seems reasonable to wait for ongoing I/O to complete before
freeing the buffer. Otherwise, another solution is we may need to
implement an asynchronous release process that returns the block to the
buddy system only after the I/O operation has completed. However, since
the write-back is triggered by bdev, it appears to be hard to implement
this solution now. What do people think?

In the long run, I think metadata buffer and writeback should be managed
by ext4 and jbd2 themselves, rather than through bdev.

[1] https://lore.kernel.org/linux-ext4/1548830980-29482-1-git-send-email-yi.zhang@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-ext4/942a44fe-4350-2f59-9913-c47ee6ff9031@xxxxxxxxxx/

Best Regards,
Yi.