Re: BUG report: an ext4 data corruption issue in nojournal mode

Jan Kara <jack@xxxxxxx> · Fri, 12 Sep 2025 16:42:51 +0200

Hello!

On Fri 12-09-25 11:28:13, Zhang Yi wrote:
> Gao Xiang recently discovered a data corruption issue in **nojournal**
> mode. After analysis, we found that the problem is after a metadata
> block is freed, it can be immediately reallocated as a data block.
> However, the metadata on this block may still be in the process of being
> written back, which means the new data in this block could potentially
> be overwritten by the stale metadata.
> 
> When releasing a metadata block, ext4_forget() calls bforget() in
> nojournal mode, which clears the dirty flag on the buffer_head. If the
> metadata has not yet started to be written back at this point, there is
> no issue. However, if the write-back has already begun but the I/O has
> not yet completed, ext4_forget() will have no effect, and the subsequent
> ext4_mb_clear_bb() will immediately return the block to the mb
> allocator. This block can then be immediately reallocated, potentially
> triggering a data loss issue.

Yes, I agree this can be a problem.

> This issue is somewhat related to this patch set[1] that have been
> merged. Before this patch set, clean_bdev_aliases() and
> clean_bdev_bh_alias() could ensure that the dirty flag of the block
> device buffer was cleared and the write-back was completed before using
> newly allocated blocks in most cases. However, this patch set have fixed
> a similar issues in journal mode and removed this safeguard because it's
> fragile and misses some corner cases[2], increasing the likelihood of
> triggering this issue.

Right.

> Furthermore, I found that this issue theoretically still appears to
> persist even in **ordered** journal mode. In the final else branch of
> jbd2_journal_forget(), if the metadata block to be released is also
> undergoing a write-back, jbd2_journal_forget() will add this buffer to
> the current transaction for forgetting. Once the current transaction is
> committed, the block can then be reallocated. However, there is no
> guarantee that the ongoing I/O will complete. Typically, the undergoing
> metadata writeback I/O does not take this long to complete, but it might
> be throttled by the block layer or delayed due to anomalies in some slow
> I/O processes in the underlying devices. Therefore, although it is
> difficult to trigger, it theoretically still exists.

I don't think this can actually happen. For writeback to be happening on a
buffer it still has to be part of a checkpoint list of some transaction.
That means we'll call jbd2_journal_try_remove_checkpoint() which will lock
the buffer and that's enough to make sure the buffer writeback has either
completed or not yet started. If I missed some case, please tell me.

> Consider the fix for now. In the **ordered** journal mode, I suppose we
> can add a wait_on_buffer() during the process of the freed buffer in
> jbd2_journal_commit_transaction(). This should not significantly impact
> performance. In **nojorunal** mode, I do not want to reintroduce
> clean_bdev_aliases(). One approach is to add wait_on_buffer() in
> __ext4_forget(), but I am concerned that this might impact performance.
> However, it seems reasonable to wait for ongoing I/O to complete before
> freeing the buffer.

I agree calling wait_on_buffer() before calling __bforget() is the best fix
for the problem in nojournal mode. Yes, it can slow down some cases where
we free metadata blocks that we recently modified but I think it should be
relatively rare.

> Otherwise, another solution is we may need to
> implement an asynchronous release process that returns the block to the
> buddy system only after the I/O operation has completed. However, since
> the write-back is triggered by bdev, it appears to be hard to implement
> this solution now. What do people think?

Yes, that will get rather complicated.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR