On Wed, Mar 19, 2025 at 08:19:19AM +0000, Hans Holmberg wrote: > Presently we start garbage collection late - when we start running > out of free zones to backfill max_open_zones. This is a reasonable > default as it minimizes write amplification. The longer we wait, > the more blocks are invalidated and reclaim cost less in terms > of blocks to relocate. > > Starting this late however introduces a risk of GC being outcompeted > by user writes. If GC can't keep up, user writes will be forced to > wait for free zones with high tail latencies as a result. > > This is not a problem under normal circumstances, but if fragmentation > is bad and user write pressure is high (multiple full-throttle > writers) we will "bottom out" of free zones. > > To mitigate this, introduce a gc_pressure mount option that lets the > user specify a percentage of how much of the unused space that gc > should keep available for writing. A high value will reclaim more of > the space occupied by unused blocks, creating a larger buffer against > write bursts. > > This comes at a cost as write amplification is increased. To > illustrate this using a sample workload, setting gc_pressure to 60% > avoids high (500ms) max latencies while increasing write amplification > by 15%. It seems to me that this is runtime workload dependent, and so maybe a tunable variable in /sys/fs/xfs/<dev>/.... might suit better? That way it can be controlled by a userspace agent as the filesystem fills and empties rather than being fixed at mount time and never really being optimal for a changing workload... > Signed-off-by: Hans Holmberg <hans.holmberg@xxxxxxx> > --- > > A patch for xfsprogs documenting the option will follow (if it makes > it beyond RFC) New mount options should also be documented in the kernel admin guide here -> Documentation/admin-guide/xfs.rst. .... > > fs/xfs/xfs_mount.h | 1 + > fs/xfs/xfs_super.c | 14 +++++++++++++- > fs/xfs/xfs_zone_alloc.c | 5 +++++ > fs/xfs/xfs_zone_gc.c | 16 ++++++++++++++-- > 4 files changed, 33 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index 799b84220ebb..af595024de00 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -229,6 +229,7 @@ typedef struct xfs_mount { > bool m_finobt_nores; /* no per-AG finobt resv. */ > bool m_update_sb; /* sb needs update in mount */ > unsigned int m_max_open_zones; > + unsigned int m_gc_pressure; This is not explicitly initialised somewhere. If the magic "mount gets zeroed on allocation" value of zero it gets means this feature is turned off, there needs to be a comment somewhere explaining why it is turned completely off rather than having a default of, say, 5% like we have for low space allocation thresholds in various other lowspace allocation and reclaim algorithms.... > --- a/fs/xfs/xfs_zone_gc.c > +++ b/fs/xfs/xfs_zone_gc.c > @@ -162,18 +162,30 @@ struct xfs_zone_gc_data { > > /* > * We aim to keep enough zones free in stock to fully use the open zone limit > - * for data placement purposes. > + * for data placement purposes. Additionally, the gc_pressure mount option > + * can be set to make sure a fraction of the unused/free blocks are available > + * for writing. > */ > bool > xfs_zoned_need_gc( > struct xfs_mount *mp) > { > + s64 available, free; > + > if (!xfs_group_marked(mp, XG_TYPE_RTG, XFS_RTG_RECLAIMABLE)) > return false; > - if (xfs_estimate_freecounter(mp, XC_FREE_RTAVAILABLE) < > + > + available = xfs_estimate_freecounter(mp, XC_FREE_RTAVAILABLE); > + > + if (available < > mp->m_groups[XG_TYPE_RTG].blocks * > (mp->m_max_open_zones - XFS_OPEN_GC_ZONES)) > return true; > + > + free = xfs_estimate_freecounter(mp, XC_FREE_RTEXTENTS); > + if (available < div_s64(free * mp->m_gc_pressure, 100)) mult_frac(free, mp->m_gc_pressure, 100) to avoid overflow. Also, this is really a free space threshold, not a dynamic "pressure" measurement... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx