Re: [PATCH] xfs: fix zoned GC data corruption due to wrong bv_offset

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Mon, 12 May 2025 10:04:00 -0700

On Mon, May 12, 2025 at 04:43:05PM +0200, Christoph Hellwig wrote:
> xfs_zone_gc_write_chunk writes out the data buffer read in earlier using
> the same bio, and currenly looks at bv_offset for the offset into the
> scratch folio for that.  But commit 26064d3e2b4d ("block: fix adding
> folio to bio") changed how bv_page and bv_offset are calculated for
> adding larger folios, breaking this fragile logic.
> 
> Switch to extracting the full physical address from the old bio_vec,
> and calculate the offset into the folio from that instead.
> 
> This fixes data corruption during garbage collection with heavy rockdsb
> workloads.  Thanks to Hans for tracking down the culprit commit during
> long bisection sessions.
> 
> Fixes: 26064d3e2b4d ("block: fix adding folio to bio")
> Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
> Reported-by: Hans Holmberg <Hans.Holmberg@xxxxxxx>
> Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> ---
>  fs/xfs/xfs_zone_gc.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_zone_gc.c b/fs/xfs/xfs_zone_gc.c
> index 8c541ca71872..a045b1dedd68 100644
> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -801,7 +801,8 @@ xfs_zone_gc_write_chunk(
>  {
>  	struct xfs_zone_gc_data	*data = chunk->data;
>  	struct xfs_mount	*mp = chunk->ip->i_mount;
> -	unsigned int		folio_offset = chunk->bio.bi_io_vec->bv_offset;
> +	phys_addr_t		bvec_paddr =
> +		bvec_phys(bio_first_bvec_all(&chunk->bio));

Hmm.  I started wondering why you can't reuse chunk->scratch->offset in
the bio_add_folio_nofail call below.  I /think/ that's because
xfs_zone_gc_start_chunk increments chunk->scratch->offset after adding
the folio to the bio?

	bio_add_folio_nofail(bio, chunk->scratch->folio, chunk->len,
			chunk->scratch->offset);
	chunk->scratch->offset += chunk->len;

And then we can attach the same scratch->folio to different read bios.
Each bio gets a different offset within the folio, right?  So
xfs_zone_gc_write_chunk really does need to find the offset_in_folio
from the read bio.  And I guess that's why you use bvec_phys to figure
that out (instead of, say, wasting space recording it again in the
xfs_gc_bio), correct?

If the answers to all my questions are 'yes' then I think I grok this
sufficiently to say
Reviewed-by: "Darrick J. Wong" <djwong@xxxxxxxxxx>

--D

>  	struct xfs_gc_bio	*split_chunk;
>  
>  	if (chunk->bio.bi_status)
> @@ -816,7 +817,7 @@ xfs_zone_gc_write_chunk(
>  
>  	bio_reset(&chunk->bio, mp->m_rtdev_targp->bt_bdev, REQ_OP_WRITE);
>  	bio_add_folio_nofail(&chunk->bio, chunk->scratch->folio, chunk->len,
> -			folio_offset);
> +			offset_in_folio(chunk->scratch->folio, bvec_paddr));
>  
>  	while ((split_chunk = xfs_zone_gc_split_write(data, chunk)))
>  		xfs_zone_gc_submit_write(data, split_chunk);
> -- 
> 2.47.2
> 
>