On 2025/6/20 0:21, Jan Kara wrote: > On Wed 11-06-25 19:16:21, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >> >> During the process of writing back folios, if >> mpage_map_and_submit_extent() exits the extent mapping loop due to an >> ENOSPC or ENOMEM error, it may result in stale data or filesystem >> inconsistency in environments where the block size is smaller than the >> folio size. >> >> When mapping a discontinuous folio in mpage_map_and_submit_extent(), >> some buffers may have already be mapped. If we exit the mapping loop >> prematurely, the folio data within the mapped range will not be written >> back, and the file's disk size will not be updated. Once the transaction >> that includes this range of extents is committed, this can lead to stale >> data or filesystem inconsistency. >> >> Fix this by submitting the current processing partial mapped folio and >> update the disk size to the end of the mapped range. >> >> Suggested-by: Jan Kara <jack@xxxxxxx> >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> >> --- >> fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 48 insertions(+), 2 deletions(-) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 3a086fee7989..d0db6e3bf158 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd) >> return 0; >> } >> >> +/* >> + * This is used to submit mapped buffers in a single folio that is not fully >> + * mapped for various reasons, such as insufficient space or journal credits. >> + */ >> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos) >> +{ >> + struct inode *inode = mpd->inode; >> + struct folio *folio; >> + int ret; >> + >> + folio = filemap_get_folio(inode->i_mapping, mpd->first_page); >> + if (IS_ERR(folio)) >> + return PTR_ERR(folio); >> + >> + ret = mpage_submit_folio(mpd, folio); >> + if (ret) >> + goto out; >> + /* >> + * Update first_page to prevent this folio from being released in >> + * mpage_release_unused_pages(), it should not equal to the folio >> + * index. >> + * >> + * The first_page will be reset to the aligned folio index when this >> + * folio is written again in the next round. Additionally, do not >> + * update wbc->nr_to_write here, as it will be updated once the >> + * entire folio has finished processing. >> + */ >> + mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT; > > Well, but there can be many folios between mpd->first_page and pos. And > this way you avoid cleaning them up (unlocking them and dropping elevated > refcount) before we restart next loop. How is this going to work? > Hmm, I don't think there can be many folios between mpd->first_page and pos. All of the fully mapped folios should be unlocked by mpage_folio_done(), and there is no elevated since it always call folio_batch_release() once we finish processing the folios. mpage_release_unused_pages() is used to clean up unsubmitted folios. For example, suppose we have a 4kB block size filesystem and we found 4 order-2 folios need to be mapped in the mpage_prepare_extent_to_map(). first_page next_page | | [HHHH][HHHH][HHHH][HHHH] H: hole L: locked LLLL LLLL LLLL LLLL In the first round in the mpage_map_and_submit_extent(), we mapped the first two folios along with the first two pages of the third folio, the mpage_map_and_submit_buffers() should then submit and unlock the first two folios, while also updating mpd->first_page to the beginning of the third folio. first_page next_page | | [WWWW][WWWW][WWHH][HHHH] H: hole L: locked UUUU UUUU LLLL LLLL W: mapped U: unlocked In the second round in the mpage_map_and_submit_extent(), we failed to map the blocks and call mpage_submit_buffers() to submit and unlock this partially mapped folio. Additionally, we increased mpd->first_page. first_page next_page | / [WWWW][WWWW][WWHH][HHHH] H: hole L: locked UUUU UUUU UUUU LLLL W: mapped U: unlocked Then, we break out to ext4_do_writepages(), which calls mpage_release_unused_pages() to unlock the last folio. first_page next_page | / [WWWW][WWWW][WWHH][HHHH] H: hole L: locked UUUU UUUU UUUU UUUU W: mapped U: unlocked In the next round in the ext4_do_writepages(), mpage_prepare_extent_to_map() restarts processing the third folio and resets the mpd->first_page to the beginning of it. first_page next_page | / [WWWW][WWWW][WWHH][HHHH] H: hole L: locked UUUU UUUU LLLL LLLL W: mapped U: unlocked > Also I don't see in this patch where mpd->first_page would get set back to > retry writing this folio. What am I missing? > We already have this setting, please refer to the following section in mpage_prepare_extent_to_map(). static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) { pgoff_t index = mpd->first_page; ... mpd->map.m_len = 0; mpd->next_page = index; ... while (index <= end) { ... if (mpd->map.m_len == 0) mpd->first_page = folio->index; ... } >> + WARN_ON_ONCE((folio->index == mpd->first_page) || >> + !folio_contains(folio, pos >> PAGE_SHIFT)); >> +out: >> + folio_unlock(folio); >> + folio_put(folio); >> + return ret; >> +} >> + >> /* >> * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length >> * mpd->len and submit pages underlying it for IO >> @@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle, >> */ >> if ((err == -ENOMEM) || >> (err == -ENOSPC && ext4_count_free_clusters(sb))) { >> - if (progress) >> + /* >> + * We may have already allocated extents for >> + * some bhs inside the folio, issue the >> + * corresponding data to prevent stale data. >> + */ >> + if (progress) { >> + if (mpage_submit_buffers(mpd, disksize)) >> + goto invalidate_dirty_pages; >> goto update_disksize; >> + } >> return err; >> } >> ext4_msg(sb, KERN_CRIT, >> @@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle, >> *give_up_on_write = true; >> return err; >> } >> + disksize = ((loff_t)(map->m_lblk + map->m_len)) << >> + inode->i_blkbits; > > I don't think setting disksize like this is correct in case > mpage_map_and_submit_buffers() below fails (when extent covers many folios > and we don't succeed in writing them all). In that case we may need to keep > disksize somewhere in the middle of the extent. > > Overall I don't think we need to modify disksize handling here. It is fine > to leave (part of) the extent dangling beyond disksize until we retry the > writeback in these rare cases. OK, this is indeed a rare case. Let's keep it as it is. Thanks, Yi. > >> progress = 1; >> /* >> * Update buffer state, submit mapped pages, and get us new >> @@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle, >> goto update_disksize; >> } while (map->m_len); >> >> + disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT; >> update_disksize: >> /* >> * Update on-disk size after IO is submitted. Races with >> * truncate are avoided by checking i_size under i_data_sem. >> */ >> - disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT; >> if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) { >> int err2; >> loff_t i_size; > > Honza