Hello! On Fri 12-09-25 11:28:13, Zhang Yi wrote: > Gao Xiang recently discovered a data corruption issue in **nojournal** > mode. After analysis, we found that the problem is after a metadata > block is freed, it can be immediately reallocated as a data block. > However, the metadata on this block may still be in the process of being > written back, which means the new data in this block could potentially > be overwritten by the stale metadata. > > When releasing a metadata block, ext4_forget() calls bforget() in > nojournal mode, which clears the dirty flag on the buffer_head. If the > metadata has not yet started to be written back at this point, there is > no issue. However, if the write-back has already begun but the I/O has > not yet completed, ext4_forget() will have no effect, and the subsequent > ext4_mb_clear_bb() will immediately return the block to the mb > allocator. This block can then be immediately reallocated, potentially > triggering a data loss issue. Yes, I agree this can be a problem. > This issue is somewhat related to this patch set[1] that have been > merged. Before this patch set, clean_bdev_aliases() and > clean_bdev_bh_alias() could ensure that the dirty flag of the block > device buffer was cleared and the write-back was completed before using > newly allocated blocks in most cases. However, this patch set have fixed > a similar issues in journal mode and removed this safeguard because it's > fragile and misses some corner cases[2], increasing the likelihood of > triggering this issue. Right. > Furthermore, I found that this issue theoretically still appears to > persist even in **ordered** journal mode. In the final else branch of > jbd2_journal_forget(), if the metadata block to be released is also > undergoing a write-back, jbd2_journal_forget() will add this buffer to > the current transaction for forgetting. Once the current transaction is > committed, the block can then be reallocated. However, there is no > guarantee that the ongoing I/O will complete. Typically, the undergoing > metadata writeback I/O does not take this long to complete, but it might > be throttled by the block layer or delayed due to anomalies in some slow > I/O processes in the underlying devices. Therefore, although it is > difficult to trigger, it theoretically still exists. I don't think this can actually happen. For writeback to be happening on a buffer it still has to be part of a checkpoint list of some transaction. That means we'll call jbd2_journal_try_remove_checkpoint() which will lock the buffer and that's enough to make sure the buffer writeback has either completed or not yet started. If I missed some case, please tell me. > Consider the fix for now. In the **ordered** journal mode, I suppose we > can add a wait_on_buffer() during the process of the freed buffer in > jbd2_journal_commit_transaction(). This should not significantly impact > performance. In **nojorunal** mode, I do not want to reintroduce > clean_bdev_aliases(). One approach is to add wait_on_buffer() in > __ext4_forget(), but I am concerned that this might impact performance. > However, it seems reasonable to wait for ongoing I/O to complete before > freeing the buffer. I agree calling wait_on_buffer() before calling __bforget() is the best fix for the problem in nojournal mode. Yes, it can slow down some cases where we free metadata blocks that we recently modified but I think it should be relatively rare. > Otherwise, another solution is we may need to > implement an asynchronous release process that returns the block to the > buddy system only after the I/O operation has completed. However, since > the write-back is triggered by bdev, it appears to be hard to implement > this solution now. What do people think? Yes, that will get rather complicated. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR