Hi, all! Gao Xiang recently discovered a data corruption issue in **nojournal** mode. After analysis, we found that the problem is after a metadata block is freed, it can be immediately reallocated as a data block. However, the metadata on this block may still be in the process of being written back, which means the new data in this block could potentially be overwritten by the stale metadata. When releasing a metadata block, ext4_forget() calls bforget() in nojournal mode, which clears the dirty flag on the buffer_head. If the metadata has not yet started to be written back at this point, there is no issue. However, if the write-back has already begun but the I/O has not yet completed, ext4_forget() will have no effect, and the subsequent ext4_mb_clear_bb() will immediately return the block to the mb allocator. This block can then be immediately reallocated, potentially triggering a data loss issue. This issue is somewhat related to this patch set[1] that have been merged. Before this patch set, clean_bdev_aliases() and clean_bdev_bh_alias() could ensure that the dirty flag of the block device buffer was cleared and the write-back was completed before using newly allocated blocks in most cases. However, this patch set have fixed a similar issues in journal mode and removed this safeguard because it's fragile and misses some corner cases[2], increasing the likelihood of triggering this issue. Furthermore, I found that this issue theoretically still appears to persist even in **ordered** journal mode. In the final else branch of jbd2_journal_forget(), if the metadata block to be released is also undergoing a write-back, jbd2_journal_forget() will add this buffer to the current transaction for forgetting. Once the current transaction is committed, the block can then be reallocated. However, there is no guarantee that the ongoing I/O will complete. Typically, the undergoing metadata writeback I/O does not take this long to complete, but it might be throttled by the block layer or delayed due to anomalies in some slow I/O processes in the underlying devices. Therefore, although it is difficult to trigger, it theoretically still exists. Consider the fix for now. In the **ordered** journal mode, I suppose we can add a wait_on_buffer() during the process of the freed buffer in jbd2_journal_commit_transaction(). This should not significantly impact performance. In **nojorunal** mode, I do not want to reintroduce clean_bdev_aliases(). One approach is to add wait_on_buffer() in __ext4_forget(), but I am concerned that this might impact performance. However, it seems reasonable to wait for ongoing I/O to complete before freeing the buffer. Otherwise, another solution is we may need to implement an asynchronous release process that returns the block to the buddy system only after the I/O operation has completed. However, since the write-back is triggered by bdev, it appears to be hard to implement this solution now. What do people think? In the long run, I think metadata buffer and writeback should be managed by ext4 and jbd2 themselves, rather than through bdev. [1] https://lore.kernel.org/linux-ext4/1548830980-29482-1-git-send-email-yi.zhang@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-ext4/942a44fe-4350-2f59-9913-c47ee6ff9031@xxxxxxxxxx/ Best Regards, Yi.