On Wed, Apr 09, 2025 at 08:14:31AM +1000, Dave Chinner wrote: > On Thu, Apr 03, 2025 at 12:12:44PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Actually compute the log overhead of log intent items used in reap > > operations and use that to compute the thresholds in reap.c instead of > > assuming 2048 works. Note that there have been no complaints because > > tr_itruncate has a very large logres. > > > > Cc: <stable@xxxxxxxxxxxxxxx> # v6.6 > > Fixes: 1c7ce115e52106 ("xfs: reap large AG metadata extents when possible") > > Signed-off-by: "Darrick J. Wong" <djwong@xxxxxxxxxx> /me notes this is an RFC patch and I just finished writing a more expansive patchset that breaks up the pieces and computes limits for each usecase separately. > > --- > > fs/xfs/scrub/trace.h | 29 ++++++++++++++++++++++++++ > > fs/xfs/xfs_bmap_item.h | 3 +++ > > fs/xfs/xfs_extfree_item.h | 3 +++ > > fs/xfs/xfs_log_priv.h | 13 +++++++++++ > > fs/xfs/xfs_refcount_item.h | 3 +++ > > fs/xfs/xfs_rmap_item.h | 3 +++ > > fs/xfs/scrub/reap.c | 50 +++++++++++++++++++++++++++++++++++++++----- > > fs/xfs/scrub/trace.c | 1 + > > fs/xfs/xfs_bmap_item.c | 10 +++++++++ > > fs/xfs/xfs_extfree_item.c | 10 +++++++++ > > fs/xfs/xfs_log_cil.c | 4 +--- > > fs/xfs/xfs_refcount_item.c | 10 +++++++++ > > fs/xfs/xfs_rmap_item.c | 10 +++++++++ > > 13 files changed, 140 insertions(+), 9 deletions(-) <snip> > > @@ -495,6 +499,37 @@ xreap_agextent_iter( > > return 0; > > } > > > > +/* > > + * Compute the worst case log overhead of the intent items needed to reap a > > + * single per-AG space extent. > > + */ > > +STATIC unsigned int > > +xreap_agextent_max_deferred_reaps( > > + struct xfs_scrub *sc) > > +{ > > + const unsigned int efi = xfs_efi_item_overhead(1); > > + const unsigned int rui = xfs_rui_item_overhead(1); > > These wrappers don't return some "overhead" - they return the amount of > log space the intent will consume. Can we please call them > xfs_<intent_type>_log_space()? Ok. > > + > > + /* unmapping crosslinked metadata blocks */ > > + const unsigned int t1 = rui; > > + > > + /* freeing metadata blocks */ > > + const unsigned int t2 = rui + efi; > > + > > + /* worst case of all four possible scenarios */ > > + const unsigned int per_intent = max(t1, t2); > > When is per_intent going to be different to t2? i.e. rui + efi > rui > will always be true, so we don't need to calculate it at runtime. Any decent compiler will combine this into: const unsigned int per_intent = rui + efi; > i.e. what "four scenarios" is this actually handling? I can't work > out what they are from the code or the description... The five things that xreap_agextent_iter() does. One of them (XFS_AG_RESV_AGFL) doesn't use intents so I only listed four. > > + /* > > + * tr_itruncate has enough logres to unmap two file extents; use only > > + * half the log reservation for intent items so there's space to do > > + * actual work and requeue intent items. > > + */ > > + const unsigned int ret = sc->tp->t_log_res / (2 * per_intent); > > So we are assuming only a single type of log reservation is used for > these reaping transactions? Well it's *currently* tr_itruncate, but some day we could add a bigger/separate repair transaction type. > If so, this is calculating a value that is constant for the life of > the mount and probably should be moved to all the other log > reservation calculation functions that are run at mount time and > stored with them. > > i.e. this calculation is effectively: > > return (xfs_calc_itruncate_reservation(mp, false) >> 1) / > (xfs_efi_log_space(1) + xfs_rui_log_space(1)); > > Also, I think that the algorithm used to calculate the number of > intents we can fit in such a transaction should be described in > the comment above the function, then implement the algorithm as I don't like ^^^^^^^^^^^^^^^^^^ this style of putting the comment at some distance from the actual computation code. Look at xfs_calc_link_reservation: /* * For creating a link to an inode: * the parent directory inode: inode size * the linked inode: inode size * the directory btree could split: (max depth + v2) * dir block size * the directory bmap btree could join or split: (max depth + v2) * blocksize * And the bmap_finish transaction can free some bmap blocks giving: * the agf for the ag in which the blocks live: sector size * the agfl for the ag in which the blocks live: sector size * the superblock for the free block count: sector size * the allocation btrees: 2 trees * (2 * max depth - 1) * block size */ STATIC uint xfs_calc_link_reservation( struct xfs_mount *mp) { overhead += xfs_calc_iunlink_remove_reservation(mp); t1 = xfs_calc_inode_res(mp, 2) + xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)); t2 = xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(xfs_allocfree_block_count(mp, 1), XFS_FSB_TO_B(mp, 1)); if (xfs_has_parent(mp)) { t3 = resp->tr_attrsetm.tr_logres; overhead += xfs_calc_pptr_link_overhead(); } Does t1 correspond to "For creating a link"? Does t2 correspond to "And the bmap_finish..." ? t3 isn't even in the comment. To me it'd be much clearer to write: /* Linking an unlinked inode into the directory tree */ overhead += xfs_calc_iunlink_remove_reservation(mp); /* * For creating a link to an inode: * the parent directory inode: inode size * the linked inode: inode size * the directory btree could split: (max depth + v2) * dir block size * the directory bmap btree could join or split: (max depth + v2) * blocksize */ t1 = xfs_calc_inode_res(mp, 2) + xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)); /* * And the bmap_finish transaction can free some bmap blocks giving: * the agf for the ag in which the blocks live: sector size * the agfl for the ag in which the blocks live: sector size * the superblock for the free block count: sector size * the allocation btrees: 2 trees * (2 * max depth - 1) * block size */ t2 = xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) + xfs_calc_buf_res(xfs_allocfree_block_count(mp, 1), /* * If parent pointers are enabled, save space to add a pptr * xattr and the space to log the pptr items. */ if (xfs_has_parent(mp)) { t3 = resp->tr_attrsetm.tr_logres; overhead += xfs_calc_pptr_link_overhead(); } return overhead + max3(t1, t2, t3); calc_link isn't too bad, but truncate is more difficult to figure out. > efficiently as possible (as we already do in xfs_trans_resv.c). The downside of moving it to xfs_trans_resv.c is that now you've separated the code that actually does the reaping from the code that figures out the dynamic limits based on how reaping can happen. > > + trace_xreap_agextent_max_deferred_reaps(sc->tp, per_intent, ret); > > If it is calculated at mount time, this can be emitted along with > all the other log reservations calculated at mount time. > > > + return max(1, ret); > > Why? tp->t_logres should always be much greater than the intent > size. How will this ever evaluate as zero without there being some > other serious log/transaction reservation bug elsewhere in the code? > > i.e. if we can get zero here, that needs ASSERTS and/or WARN_ON > output and a filesystem shutdown because there is something > seriously wrong occurring in the filesystem. Ok. I'll just shut down the fs then. --D > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx