Re: [PATCH -next] ext4: add an update to i_disksize in ext4_block_page_mkwrite

Sun Yongjian <sunyongjian1@xxxxxxxxxx> · Sat, 6 Sep 2025 20:27:03 +0800

在 2025/9/5 20:58, Jan Kara 写道:
On Fri 05-09-25 11:25:49, Sun Yongjian wrote:
在 2025/9/4 17:11, Jan Kara 写道:
On Mon 01-09-25 15:01:45, Sun Yongjian wrote:
在 2025/7/31 22:05, sunyongjian@xxxxxxxxxxxxxxx 写道:
Gentle ping.
From: Yongjian Sun <sunyongjian1@xxxxxxxxxx>

After running a stress test combined with fault injection,
we performed fsck -a followed by fsck -fn on the filesystem
image. During the second pass, fsck -fn reported:

Inode 131512, end of extent exceeds allowed value
	(logical block 405, physical block 1180540, len 2)

This inode was not in the orphan list. Analysis revealed the
following call chain that leads to the inconsistency:

                                ext4_da_write_end()
                                 //does not update i_disksize
                                ext4_punch_hole()
                                 //truncate folio, keep size
ext4_page_mkwrite()
    ext4_block_page_mkwrite()
     ext4_block_write_begin()
       ext4_get_block()
        //insert written extent without update i_disksize
journal commit
echo 1 > /sys/block/xxx/device/delete

da-write path updates i_size but does not update i_disksize. Then
ext4_punch_hole truncates the da-folio yet still leaves i_disksize
unchanged. Then ext4_page_mkwrite sees ext4_nonda_switch return 1
and takes the nodioread_nolock path, the folio about to be written
has just been punched out, and it’s offset sits beyond the current
i_disksize. This may result in a written extent being inserted, but
again does not update i_disksize. If the journal gets committed and
then the block device is yanked, we might run into this.

To fix this, we now check in ext4_block_page_mkwrite whether
i_disksize needs to be updated to cover the newly allocated blocks.

Signed-off-by: Yongjian Sun <sunyongjian1@xxxxxxxxxx>

OK, after the discussion with Ritesh your solution looks like the best one.
Just two nits below:

---
    fs/ext4/inode.c | 10 ++++++++++
    1 file changed, 10 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ed54c4d0f2f9..050270b265ae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6666,8 +6666,18 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
    		goto out_error;
    	if (!ext4_should_journal_data(inode)) {
+		loff_t disksize = folio_pos(folio) + len;

Use an empty line between declarations and the code please.

    		block_commit_write(folio, 0, len);
    		folio_mark_dirty(folio);
+		if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
+			down_write(&EXT4_I(inode)->i_data_sem);
+			if (disksize > EXT4_I(inode)->i_disksize)
+				EXT4_I(inode)->i_disksize = disksize;
+			up_write(&EXT4_I(inode)->i_data_sem);
+			ret = ext4_mark_inode_dirty(handle, inode);
+			if (ret)
+				goto out_error;
+		}

Since we don't support delalloc with data journalling, your code is correct
but I think it would be more understandable if you just moved the
i_disksize update outside of the "if (!ext4_should_journal_data(inode))"
condition.

    	} else {
    		ret = ext4_journal_folio_buffers(handle, folio, len);
    		if (ret)


								Honza
Thanks for the review, I will send a patch to improve this!

Yesterday on ext4 developers call we were further discussing this and Ted
came up with a different way of addressing this issue which might be even
better. Instead of updating i_disksize in ext4_page_mkwrite() we can
instead update i_disksize already during the hole punch. I.e., we can modify
ext4_update_disksize_before_punch() to always increase i_disksize to offset
+ len. That should deal with the problem as well and we would avoid
updating i_disksize from page_mkwrite() which is a bit awkward special case.

								Honza

I believe this bring a more elegant approach to the matter, let's try this!