Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Sat, 21 Jun 2025 12:42:11 +0800

On 2025/6/20 23:21, Jan Kara wrote:
> On Fri 20-06-25 12:57:18, Zhang Yi wrote:
>> On 2025/6/20 0:21, Jan Kara wrote:
>>> On Wed 11-06-25 19:16:21, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>
>>>> During the process of writing back folios, if
>>>> mpage_map_and_submit_extent() exits the extent mapping loop due to an
>>>> ENOSPC or ENOMEM error, it may result in stale data or filesystem
>>>> inconsistency in environments where the block size is smaller than the
>>>> folio size.
>>>>
>>>> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
>>>> some buffers may have already be mapped. If we exit the mapping loop
>>>> prematurely, the folio data within the mapped range will not be written
>>>> back, and the file's disk size will not be updated. Once the transaction
>>>> that includes this range of extents is committed, this can lead to stale
>>>> data or filesystem inconsistency.
>>>>
>>>> Fix this by submitting the current processing partial mapped folio and
>>>> update the disk size to the end of the mapped range.
>>>>
>>>> Suggested-by: Jan Kara <jack@xxxxxxx>
>>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>> ---
>>>>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>>  1 file changed, 48 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 3a086fee7989..d0db6e3bf158 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>>>  	return 0;
>>>>  }
>>>>  
>>>> +/*
>>>> + * This is used to submit mapped buffers in a single folio that is not fully
>>>> + * mapped for various reasons, such as insufficient space or journal credits.
>>>> + */
>>>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
>>>> +{
>>>> +	struct inode *inode = mpd->inode;
>>>> +	struct folio *folio;
>>>> +	int ret;
>>>> +
>>>> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
>>>> +	if (IS_ERR(folio))
>>>> +		return PTR_ERR(folio);
>>>> +
>>>> +	ret = mpage_submit_folio(mpd, folio);
>>>> +	if (ret)
>>>> +		goto out;
>>>> +	/*
>>>> +	 * Update first_page to prevent this folio from being released in
>>>> +	 * mpage_release_unused_pages(), it should not equal to the folio
>>>> +	 * index.
>>>> +	 *
>>>> +	 * The first_page will be reset to the aligned folio index when this
>>>> +	 * folio is written again in the next round. Additionally, do not
>>>> +	 * update wbc->nr_to_write here, as it will be updated once the
>>>> +	 * entire folio has finished processing.
>>>> +	 */
>>>> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
>>>
>>> Well, but there can be many folios between mpd->first_page and pos. And
>>> this way you avoid cleaning them up (unlocking them and dropping elevated
>>> refcount) before we restart next loop. How is this going to work?
>>>
>>
>> Hmm, I don't think there can be many folios between mpd->first_page and
>> pos. All of the fully mapped folios should be unlocked by
>> mpage_folio_done(), and there is no elevated since it always call
>> folio_batch_release() once we finish processing the folios.
> 
> Indeed. I forgot that mpage_map_one_extent() with shorten mpd->map->m_len
> to the number of currently mapped blocks.
> 
>> mpage_release_unused_pages() is used to clean up unsubmitted folios.
>>
>> For example, suppose we have a 4kB block size filesystem and we found 4
>> order-2 folios need to be mapped in the mpage_prepare_extent_to_map().
>>
>>        first_page             next_page
>>        |                      |
>>       [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
>>        LLLL  LLLL  LLLL  LLLL
>>
>> In the first round in the mpage_map_and_submit_extent(), we mapped the
>> first two folios along with the first two pages of the third folio, the
>> mpage_map_and_submit_buffers() should then submit and unlock the first
>> two folios, while also updating mpd->first_page to the beginning of the
>> third folio.
>>
>>                   first_page  next_page
>>                   |          |
>>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>>        UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked
>>
>> In the second round in the mpage_map_and_submit_extent(), we failed to
>> map the blocks and call mpage_submit_buffers() to submit and unlock
>> this partially mapped folio. Additionally, we increased mpd->first_page.
>>
>>                      first_page next_page
>>                      |        /
>>       [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
>>        UUUU  UUUU  UUUU  LLLL               W: mapped  U: unlocked
> 
> Good. But what if we have a filesystem with 1k blocksize and order 0
> folios? I mean situation like:
> 
>         first_page             next_page
>         |                      |
>        [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
>         L     L     L     L
> 
> Now we map first two folios.
> 
>                    first_page  next_page
>                    |           |
>        [MMMM][MMMM][HHHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> Now mpage_map_one_extent() maps half of the folio and fails to extend the
> transaction further:
> 
>                    first_page  next_page
>                    |           |
>        [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> and mpage_submit_folio() now shifts mpd->first page like:
> 
>                           first_page
>                           |    next_page
>                           |    |
>        [MMMM][MMMM][MMHH][HHHH]              H: hole  L: locked
>                     L     L
> 
> and it never gets reset back?
> 
> I suspect you thought that the failure to extend transaction in the middle
> of order 0 folio should not happen because you reserve credits for full
> page worth of writeback? But those credits could be exhaused by the time we
> get to mapping third folio because mpage_map_one_extent() only ensures
> there are credits for mapping one extent.

Ooops, you are right. Sorry, it was my mistake.

> 
> And I think reserving credits for just one extent is fine even from the
> beginning (as I wrote in my comment to patch 4). We just need to handle
> this partial case 

Yeah.

> which should be possible by just leaving
> mpd->first_page untouched and leave unlocking to
> mpage_release_unused_pages(). 

I was going to use this solution, but it breaks the semantics of the
mpage_release_unused_pages() and trigger BUG_ON(folio_test_writeback(folio))
in this function. I don't want to drop this BUG_ON since I think it's
somewhat useful.

> But I can be missing some effects, the writeback code is really complex...

Indeed, I was confused by this code for a long time. Thank you a lot for
patiently correcting my mistakes in my patch.

> BTW long-term the code may be easier to follow if we replace
> mpd->first_page and mpd->next_page with logical block based or byte based
> indexing. Now when we have large order folios, page is not that important
> concept for writeback anymore.
> 

I suppose we should do this conversion now.

Thanks,
Yi.