Re: [RFC PATCH] xfs: add mount option for zone gc pressure

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 19 Mar 2025 20:11:46 +1100

On Wed, Mar 19, 2025 at 08:19:19AM +0000, Hans Holmberg wrote:
> Presently we start garbage collection late - when we start running
> out of free zones to backfill max_open_zones. This is a reasonable
> default as it minimizes write amplification. The longer we wait,
> the more blocks are invalidated and reclaim cost less in terms
> of blocks to relocate.
> 
> Starting this late however introduces a risk of GC being outcompeted
> by user writes. If GC can't keep up, user writes will be forced to
> wait for free zones with high tail latencies as a result.
> 
> This is not a problem under normal circumstances, but if fragmentation
> is bad and user write pressure is high (multiple full-throttle
> writers) we will "bottom out" of free zones.
> 
> To mitigate this, introduce a gc_pressure mount option that lets the
> user specify a percentage of how much of the unused space that gc
> should keep available for writing. A high value will reclaim more of
> the space occupied by unused blocks, creating a larger buffer against
> write bursts.
> 
> This comes at a cost as write amplification is increased. To
> illustrate this using a sample workload, setting gc_pressure to 60%
> avoids high (500ms) max latencies while increasing write amplification
> by 15%.

It seems to me that this is runtime workload dependent, and so maybe
a tunable variable in /sys/fs/xfs/<dev>/.... might suit better?

That way it can be controlled by a userspace agent as the filesystem
fills and empties rather than being fixed at mount time and never
really being optimal for a changing workload...

> Signed-off-by: Hans Holmberg <hans.holmberg@xxxxxxx>
> ---
> 
> A patch for xfsprogs documenting the option will follow (if it makes
> it beyond RFC)

New mount options should also be documented in the kernel admin
guide here -> Documentation/admin-guide/xfs.rst.

....
> 
>  fs/xfs/xfs_mount.h      |  1 +
>  fs/xfs/xfs_super.c      | 14 +++++++++++++-
>  fs/xfs/xfs_zone_alloc.c |  5 +++++
>  fs/xfs/xfs_zone_gc.c    | 16 ++++++++++++++--
>  4 files changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 799b84220ebb..af595024de00 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -229,6 +229,7 @@ typedef struct xfs_mount {
>  	bool			m_finobt_nores; /* no per-AG finobt resv. */
>  	bool			m_update_sb;	/* sb needs update in mount */
>  	unsigned int		m_max_open_zones;
> +	unsigned int		m_gc_pressure;

This is not explicitly initialised somewhere. If the magic "mount
gets zeroed on allocation" value of zero it gets means this feature
is turned off, there needs to be a comment somewhere explaining why
it is turned completely off rather than having a default of, say,
5% like we have for low space allocation thresholds in various
other lowspace allocation and reclaim algorithms....

> --- a/fs/xfs/xfs_zone_gc.c
> +++ b/fs/xfs/xfs_zone_gc.c
> @@ -162,18 +162,30 @@ struct xfs_zone_gc_data {
>  
>  /*
>   * We aim to keep enough zones free in stock to fully use the open zone limit
> - * for data placement purposes.
> + * for data placement purposes. Additionally, the gc_pressure mount option
> + * can be set to make sure a fraction of the unused/free blocks are available
> + * for writing.
>   */
>  bool
>  xfs_zoned_need_gc(
>  	struct xfs_mount	*mp)
>  {
> +	s64			available, free;
> +
>  	if (!xfs_group_marked(mp, XG_TYPE_RTG, XFS_RTG_RECLAIMABLE))
>  		return false;
> -	if (xfs_estimate_freecounter(mp, XC_FREE_RTAVAILABLE) <
> +
> +	available = xfs_estimate_freecounter(mp, XC_FREE_RTAVAILABLE);
> +
> +	if (available <
>  	    mp->m_groups[XG_TYPE_RTG].blocks *
>  	    (mp->m_max_open_zones - XFS_OPEN_GC_ZONES))
>  		return true;
> +
> +	free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS);
> +	if (available < div_s64(free * mp->m_gc_pressure, 100))

mult_frac(free, mp->m_gc_pressure, 100) to avoid overflow.

Also, this is really a free space threshold, not a dynamic
"pressure" measurement...

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx