Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing

Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> · Wed, 6 Aug 2025 11:10:30 +0800

On 2025/8/5 21:08, Brian Foster wrote:
> On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote:
>> On 2025/7/30 21:17, Brian Foster wrote:
>>> On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote:
>>>> On 2025/7/18 21:48, Brian Foster wrote:
>>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote:
>>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote:
>>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote:
>>>>>>>> The only way zero range can currently process unwritten mappings
>>>>>>>> with dirty pagecache is to check whether the range is dirty before
>>>>>>>> mapping lookup and then flush when at least one underlying mapping
>>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from
>>>>>>>> racing with folio writeback and reclaim.
>>>>>>>>
>>>>>>>> Since zero range can skip ranges of unwritten mappings that are
>>>>>>>> clean in cache, this operation can be improved by allowing the
>>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In
>>>>>>>> turn, rather than flush or iterate file offsets, zero range can
>>>>>>>> iterate on folios in the batch and advance over clean or uncached
>>>>>>>> ranges in between.
>>>>>>>>
>>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to
>>>>>>>
>>>>>>> /me confused by the single quote; is this supposed to read:
>>>>>>>
>>>>>>> "...for the fs to populate..."?
>>>>>>>
>>>>>>> Either way the code changes look like a reasonable thing to do for the
>>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the
>>>>>>> mapping lock) so
>>>>>>>
>>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@xxxxxxxxxx>
>>>>>>>
>>>>>>> --D
>>>>>>>
>>>>>>>
>>>>>>>> populate the batch at lookup time. Update the folio lookup path to
>>>>>>>> return the next folio in the batch, if provided, and advance the
>>>>>>>> iter if the folio starts beyond the current offset.
>>>>>>>>
>>>>>>>> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
>>>>>>>> Reviewed-by: Christoph Hellwig <hch@xxxxxx>
>>>>>>>> ---
>>>>>>>>  fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++---
>>>>>>>>  fs/iomap/iter.c        |  6 +++
>>>>>>>>  include/linux/iomap.h  |  4 ++
>>>>>>>>  3 files changed, 94 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644
>>>>>>>> --- a/fs/iomap/buffered-io.c
>>>>>>>> +++ b/fs/iomap/buffered-io.c
>>>>>> [...]
>>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>>>>>>>>  	return status;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +loff_t
>>>>>>>> +iomap_fill_dirty_folios(
>>>>>>>> +	struct iomap_iter	*iter,
>>>>>>>> +	loff_t			offset,
>>>>>>>> +	loff_t			length)
>>>>>>>> +{
>>>>>>>> +	struct address_space	*mapping = iter->inode->i_mapping;
>>>>>>>> +	pgoff_t			start = offset >> PAGE_SHIFT;
>>>>>>>> +	pgoff_t			end = (offset + length - 1) >> PAGE_SHIFT;
>>>>>>>> +
>>>>>>>> +	iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL);
>>>>>>>> +	if (!iter->fbatch)
>>>>>>
>>>>>> Hi, Brian!
>>>>>>
>>>>>> I think ext4 needs to be aware of this failure after it converts to use
>>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the
>>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range.
>>>>>> This could potentially lead to a deadlock, as most calls to
>>>>>> ext4_block_zero_page_range() occur under an active journal handle.
>>>>>> Writeback operations under an active journal handle may result in circular
>>>>>> waiting within journal transactions. So please return this error code, and
>>>>>> then ext4 can interrupt zero operations to prevent deadlock.
>>>>>>
>>>>>
>>>>> Hi Yi,
>>>>>
>>>>> Thanks for looking at this.
>>>>>
>>>>> Huh.. so the reason for falling back like this here is just that this
>>>>> was considered an optional optimization, with the flush in
>>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're
>>>>> saying means that the current zero range behavior without this series is
>>>>> problematic for ext4-on-iomap..? 
>>>>
>>>> Yes.
>>>>
>>>>> If so, have you observed issues you can share details about?
>>>>
>>>> Sure.
>>>>
>>>> Before delving into the specific details of this issue, I would like
>>>> to provide some background information on the rule that ext4 cannot
>>>> wait for writeback in an active journal handle. If you are aware of
>>>> this background, please skip this paragraph. During ext4 writing back
>>>> the page cache, it may start a new journal handle to allocate blocks,
>>>> update the disksize, and convert unwritten extents after the I/O is
>>>> completed. When starting this new journal handle, if the current
>>>> running journal transaction is in the process of being submitted or
>>>> if the journal space is insufficient, it must wait for the ongoing
>>>> transaction to be completed, but the prerequisite for this is that all
>>>> currently running handles must be terminated. However, if we flush the
>>>> page cache under an active journal handle, we cannot stop it, which
>>>> may lead to a deadlock.
>>>>
>>>
>>> Ok, makes sense.
>>>
>>>> Now, the issue I have observed occurs when I attempt to use
>>>> iomap_zero_range() within ext4_block_zero_page_range(). My current
>>>> implementation are below(based on the latest fs-next).
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 28547663e4fd..1a21667f3f7c 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>>>  	return 0;
>>>>  }
>>>>
>>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset,
>>>> +			loff_t length, unsigned int flags, struct iomap *iomap,
>>>> +			struct iomap *srcmap)
>>>> +{
>>>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
>>>> +	struct ext4_map_blocks map;
>>>> +	u8 blkbits = inode->i_blkbits;
>>>> +	int ret;
>>>> +
>>>> +	ret = ext4_emergency_state(inode->i_sb);
>>>> +	if (unlikely(ret))
>>>> +		return ret;
>>>> +
>>>> +	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* Calculate the first and last logical blocks respectively. */
>>>> +	map.m_lblk = offset >> blkbits;
>>>> +	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
>>>> +			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>>>> +
>>>> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
>>>> +	if (ret < 0)
>>>> +		return ret;
>>>> +
>>>> +	/*
>>>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>>>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>>>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>>>> +	 */
>>>> +	if ((map.m_flags & EXT4_MAP_UNWRITTEN) &&
>>>> +	    map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) {
>>>> +		loff_t end;
>>>> +
>>>> +		end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits,
>>>> +					      map.m_len << blkbits);
>>>> +		if ((end >> blkbits) < map.m_lblk + map.m_len)
>>>> +			map.m_len = (end >> blkbits) - map.m_lblk;
>>>> +	}
>>>> +
>>>> +	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = {
>>>> +	.iomap_begin = ext4_iomap_buffered_zero_begin,
>>>> +};
>>>>
>>>>  const struct iomap_ops ext4_iomap_buffered_write_ops = {
>>>>  	.iomap_begin = ext4_iomap_buffered_write_begin,
>>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>>>  	return err;
>>>>  }
>>>>
>>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from,
>>>> +					loff_t length)
>>>> +{
>>>> +	WARN_ON_ONCE(!inode_is_locked(inode) &&
>>>> +		     !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
>>>> +
>>>> +	return iomap_zero_range(inode, from, length, NULL,
>>>> +				&ext4_iomap_buffered_zero_ops,
>>>> +				&ext4_iomap_write_ops, NULL);
>>>> +}
>>>> +
>>>>  /*
>>>>   * ext4_block_zero_page_range() zeros out a mapping of length 'length'
>>>>   * starting from file offset 'from'.  The range to be zero'd must
>>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
>>>>  	if (IS_DAX(inode)) {
>>>>  		return dax_zero_range(inode, from, length, NULL,
>>>>  				      &ext4_iomap_ops);
>>>> +	} else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
>>>> +		return ext4_iomap_zero_range(inode, from, length);
>>>>  	}
>>>>  	return __ext4_block_zero_page_range(handle, mapping, from, length);
>>>>  }
>>>>
>>>> The problem is most calls to ext4_block_zero_page_range() occur under
>>>> an active journal handle, so I can reproduce the deadlock issue easily
>>>> without this series.
>>>>
>>>>>
>>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what
>>>>> the error handling would look like in ext4. Do you expect to the fail
>>>>> the higher level operation, for example? Cycle locks and retry, etc.?
>>>>
>>>> Originally, I wanted ext4_block_zero_page_range() to return a failure
>>>> to the higher level operation. However, unfortunately, after my testing
>>>> today, I discovered that even though we implement this, this series still
>>>> cannot resolve the issue. The corner case is:
>>>>
>>>> Assume we have a dirty folio covers both hole and unwritten mappings.
>>>>
>>>>    |- dirty folio  -|
>>>>    [hhhhhhhhuuuuuuuu]                h:hole, u:unwrtten
>>>>
>>>> If we punch the range of the hole, ext4_punch_hole()->
>>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio.
>>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio
>>>> since the target range is a hole. Finally, iomap_zero_range() will still
>>>> flush this whole folio and lead to deadlock during writeback the latter
>>>> half of the folio.
>>>>
>>>
>>> Hmm.. Ok. So it seems there are at least a couple ways around this
>>> particular quirk. I suspect one is that you could just call the fill
>>> helper in the hole case as well, but that's kind of a hack and not
>>> really intended use.
>>>
>>> The other way goes back to the fact that the flush for the hole case was
>>> kind of a corner case hack in the first place. The original comment for
>>> that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap:
>>> make zero range flush conditional on unwritten mappings") for reference
>>> to the original intent.
>>>
>>> I'd have to go back and investigate if something regresses with that
>>> taken out, but my recollection is that was something that needed proper
>>> fixing eventually anyways. I'm particularly wondering if that is no
>>> longer an issue now that pagecache_isize_extended() handles the post-eof
>>> zeroing (the caveat being we might just need to call it in some
>>> additional size extension cases besides just setattr/truncate).
>>
>> Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in
>> pagecache_isize_extended() should work.
>>
> 
> Ok..
> 
>>>
>>>>>
>>>>> The reason I ask is because the folio_batch handling has come up through
>>>>> discussions on this series. My position so far has been to keep it as a
>>>>> separate allocation and to keep things simple since it is currently
>>>>> isolated to zero range, but that may change if the usage spills over to
>>>>> other operations (which seems expected at this point). I suspect that if
>>>>> a filesystem actually depends on this for correct behavior, that is
>>>>> another data point worth considering on that topic.
>>>>>
>>>>> So that has me wondering if it would be better/easier here to perhaps
>>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on
>>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer
>>>>> there instead of doing the dynamic allocation (then the fill helper
>>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on
>>>>> something like that?
>>>>>
>>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems
>>>>> to have mostly wound down. Any objection if the fix for that comes along
>>>>> as a followup patch rather than a rework of this series?
>>>>
>>>> It seems that we don't need to modify this series, we need to consider
>>>> other solutions to resolve this deadlock issue.
>>>>
>>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all
>>>> instances of ext4_block_zero_page_range() out of the running journal
>>>> handle(please see patch 19-21). But I don't think this is a good solution
>>>> since it's complex and fragile. Besides, after commit c7fc0366c6562
>>>> ("ext4: partial zero eof block on unaligned inode size extension"), you
>>>> added more invocations of ext4_zero_partial_blocks(), and the situation
>>>> has become more complicated (Althrough I think the calls in the three
>>>> write_end callbacks can be removed).
>>>>
>>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios
>>>> over unwritten mappings before zeroing partial blocks. This is because
>>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in
>>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is
>>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it
>>>> must has already been zeroed. Was I missing something? Therefore, I was
>>>> wondering if there are any ways to prevent flushing in
>>>> iomap_zero_range()? Any ideas?
>>>
>>> It's certainly possible that the quirk fixed by the flush the hole case
>>> was never a problem on ext4, if that's what you mean. Most of the
>>> testing for this was on XFS since ext4 hadn't used iomap for buffered
>>> writes.
>>>
>>> At the end of the day, the batch mechanism is intended to facilitate
>>> avoiding the flush entirely. I'm still paging things back in here.. but
>>> if we had two smallish changes to this code path to 1. eliminate the
>>> dynamic folio_batch allocation and 2. drop the flush on hole mapping
>>> case, would that address the issues with iomap zero range for ext4?
>>>
>>
>> Thank you for looking at this!
>>
>> I made a simple modification to the iomap_zero_range() function based
>> on the second solution you mentioned, then tested it using kvm-xfstests
>> these days. This solution works fine on ext4 and I don't find any other
>> risks by now. (Since my testing environment has sufficient memory, I
>> have not yet handled the case of memory allocation failure).
>>
> 
> Great, thanks for evaluating that. I've been playing around with the
> exact same change the past few days. As it turns out, this still breaks
> something with XFS. I've narrowed it down to an interaction between a
> large eof folio that fails to split on truncate due to being dirty, COW
> prealloc and insert range racing with writeback in such a way that this
> results in a mapped post-eof block on the file. Technically that isn't
> the end of the world so long as it is zeroed, but a subsequent zero
> range can warn if the folio is reclaimed and we now end up with a new
> one that starts beyond EOF, because those folios don't write back.
> 
> I have a mix of hacks that seems to address the generic/363 failure, but
> it still needs further testing and analysis to unwind my mess of various
> experiments and whatnot. ;P

OK, thanks for debugging and analyzing this.

> 
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>> 		     srcmap->type == IOMAP_UNWRITTEN)) {
>> 			s64 status;
>>
>> -			if (range_dirty) {
>> +			if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) {
>> 				range_dirty = false;
>> 				status = iomap_zero_iter_flush_and_stale(&iter);
>> 			} else {
>>
>> Another thing I want to mention (although there are no real issues at
>> the moment, I still want to mention it) is that there appears to be
>> no consistency guarantee between the lookup of the mapping and the
>> follo_batch. For example, assume we have a file which contains two
>> dirty folio and two unwritten extents, one folio corresponds to one
>> extent. We zero out these two folios.
>>
>>     | dirty folio 1  || dirty folio 2   |
>>     [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu]
>>
>> In the first call to ->iomap_begin(), we get the unwritten extent 1.
>> At the same time, another thread writes back folio 1 and clears this
>> folio, so this folio will not be added to the follo_batch. Then
>> iomap_zero_range() will still flush those two folios. When flushing
>> the second folio, there is still a risk of deadlock due to changes in
>> metadata.
>>
> 
> Hmm.. not sure I follow the example. The folio batch should include the
> folio in any case other than where it can look up, lock it and confirm
> it is clean. If the folio is clean and thus not included, the iomap
> logic should still see the empty folio batch and skip over the mapping
> if unwritten. (I want to replace this with a flag to address your memory
> allocation concern, but that is orthogonal to this logic.)
> 
> Of course this should happen under appropriate fs locks such that iomap
> either sees the folio in dirty/writeback state where the mapping is
> unwritten, or if the folio has been cleaned, the mapping is reported as
> written.
> 
> If I'm still missing something with your example above, can you
> elaborate a bit further? Thanks.
> 

Sorry for not make things clear. The race condition is the following,

zero range                            sync_file_range
iomap_zero_range() //folio 1+2
 range_dirty = filemap_range_needs_writeback()
 //range_dirty is set to 'ture'
 iomap_iter()
   ext4_iomap_buffer_zero_begin()
     ext4_map_blocks()
     //get unwritten extent 1
                                      sync_file_range() //folio 1
                                        iomap_writepages()
                                        ...
                                        iomap_finish_ioend()
                                          folio_end_writeback()
                                          //clear folio 1, and
                                          //extent 1 becomes written
     iomap_fill_dirty_folios()
     //do not add folio 1 to batch
  iomap_zero_iter_flush_and_stale()
  //!fbatch && IOMAP_UNWRITTEN && range_dirty
  //flush folio 1+2, folio 2 is still dirty, then deadlock

Besides, if the range of folio 1 is initially clean and unwritten
(folio 2 is still dirty), the flush can also be triggered without the
concurrent sync_file_range.

The problem is that we initially checked `range_dirty` only once, and
if was done for the entire zero range, instead of checking it each
iteration, and there is no fs lock can prevent a concurrent write-back.
Perhaps we need to check 'dirty_range' for each iteration?

Regards,
Yi.