On 2025/6/20 23:21, Jan Kara wrote: > On Fri 20-06-25 12:57:18, Zhang Yi wrote: >> On 2025/6/20 0:21, Jan Kara wrote: >>> On Wed 11-06-25 19:16:21, Zhang Yi wrote: >>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >>>> >>>> During the process of writing back folios, if >>>> mpage_map_and_submit_extent() exits the extent mapping loop due to an >>>> ENOSPC or ENOMEM error, it may result in stale data or filesystem >>>> inconsistency in environments where the block size is smaller than the >>>> folio size. >>>> >>>> When mapping a discontinuous folio in mpage_map_and_submit_extent(), >>>> some buffers may have already be mapped. If we exit the mapping loop >>>> prematurely, the folio data within the mapped range will not be written >>>> back, and the file's disk size will not be updated. Once the transaction >>>> that includes this range of extents is committed, this can lead to stale >>>> data or filesystem inconsistency. >>>> >>>> Fix this by submitting the current processing partial mapped folio and >>>> update the disk size to the end of the mapped range. >>>> >>>> Suggested-by: Jan Kara <jack@xxxxxxx> >>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> >>>> --- >>>> fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++-- >>>> 1 file changed, 48 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>> index 3a086fee7989..d0db6e3bf158 100644 >>>> --- a/fs/ext4/inode.c >>>> +++ b/fs/ext4/inode.c >>>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd) >>>> return 0; >>>> } >>>> >>>> +/* >>>> + * This is used to submit mapped buffers in a single folio that is not fully >>>> + * mapped for various reasons, such as insufficient space or journal credits. >>>> + */ >>>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos) >>>> +{ >>>> + struct inode *inode = mpd->inode; >>>> + struct folio *folio; >>>> + int ret; >>>> + >>>> + folio = filemap_get_folio(inode->i_mapping, mpd->first_page); >>>> + if (IS_ERR(folio)) >>>> + return PTR_ERR(folio); >>>> + >>>> + ret = mpage_submit_folio(mpd, folio); >>>> + if (ret) >>>> + goto out; >>>> + /* >>>> + * Update first_page to prevent this folio from being released in >>>> + * mpage_release_unused_pages(), it should not equal to the folio >>>> + * index. >>>> + * >>>> + * The first_page will be reset to the aligned folio index when this >>>> + * folio is written again in the next round. Additionally, do not >>>> + * update wbc->nr_to_write here, as it will be updated once the >>>> + * entire folio has finished processing. >>>> + */ >>>> + mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT; >>> >>> Well, but there can be many folios between mpd->first_page and pos. And >>> this way you avoid cleaning them up (unlocking them and dropping elevated >>> refcount) before we restart next loop. How is this going to work? >>> >> >> Hmm, I don't think there can be many folios between mpd->first_page and >> pos. All of the fully mapped folios should be unlocked by >> mpage_folio_done(), and there is no elevated since it always call >> folio_batch_release() once we finish processing the folios. > > Indeed. I forgot that mpage_map_one_extent() with shorten mpd->map->m_len > to the number of currently mapped blocks. > >> mpage_release_unused_pages() is used to clean up unsubmitted folios. >> >> For example, suppose we have a 4kB block size filesystem and we found 4 >> order-2 folios need to be mapped in the mpage_prepare_extent_to_map(). >> >> first_page next_page >> | | >> [HHHH][HHHH][HHHH][HHHH] H: hole L: locked >> LLLL LLLL LLLL LLLL >> >> In the first round in the mpage_map_and_submit_extent(), we mapped the >> first two folios along with the first two pages of the third folio, the >> mpage_map_and_submit_buffers() should then submit and unlock the first >> two folios, while also updating mpd->first_page to the beginning of the >> third folio. >> >> first_page next_page >> | | >> [WWWW][WWWW][WWHH][HHHH] H: hole L: locked >> UUUU UUUU LLLL LLLL W: mapped U: unlocked >> >> In the second round in the mpage_map_and_submit_extent(), we failed to >> map the blocks and call mpage_submit_buffers() to submit and unlock >> this partially mapped folio. Additionally, we increased mpd->first_page. >> >> first_page next_page >> | / >> [WWWW][WWWW][WWHH][HHHH] H: hole L: locked >> UUUU UUUU UUUU LLLL W: mapped U: unlocked > > Good. But what if we have a filesystem with 1k blocksize and order 0 > folios? I mean situation like: > > first_page next_page > | | > [HHHH][HHHH][HHHH][HHHH] H: hole L: locked > L L L L > > Now we map first two folios. > > first_page next_page > | | > [MMMM][MMMM][HHHH][HHHH] H: hole L: locked > L L > > Now mpage_map_one_extent() maps half of the folio and fails to extend the > transaction further: > > first_page next_page > | | > [MMMM][MMMM][MMHH][HHHH] H: hole L: locked > L L > > and mpage_submit_folio() now shifts mpd->first page like: > > first_page > | next_page > | | > [MMMM][MMMM][MMHH][HHHH] H: hole L: locked > L L > > and it never gets reset back? > > I suspect you thought that the failure to extend transaction in the middle > of order 0 folio should not happen because you reserve credits for full > page worth of writeback? But those credits could be exhaused by the time we > get to mapping third folio because mpage_map_one_extent() only ensures > there are credits for mapping one extent. Ooops, you are right. Sorry, it was my mistake. > > And I think reserving credits for just one extent is fine even from the > beginning (as I wrote in my comment to patch 4). We just need to handle > this partial case Yeah. > which should be possible by just leaving > mpd->first_page untouched and leave unlocking to > mpage_release_unused_pages(). I was going to use this solution, but it breaks the semantics of the mpage_release_unused_pages() and trigger BUG_ON(folio_test_writeback(folio)) in this function. I don't want to drop this BUG_ON since I think it's somewhat useful. > But I can be missing some effects, the writeback code is really complex... Indeed, I was confused by this code for a long time. Thank you a lot for patiently correcting my mistakes in my patch. > BTW long-term the code may be easier to follow if we replace > mpd->first_page and mpd->next_page with logical block based or byte based > indexing. Now when we have large order folios, page is not that important > concept for writeback anymore. > I suppose we should do this conversion now. Thanks, Yi.