On Fri, May 23, 2025 at 04:58:18PM +0800, libaokun@xxxxxxxxxxxxxxx wrote: > From: Baokun Li <libaokun1@xxxxxxxxxx> > > When ext4 allocates blocks, we used to just go through the block groups > one by one to find a good one. But when there are tons of block groups > (like hundreds of thousands or even millions) and not many have free space > (meaning they're mostly full), it takes a really long time to check them > all, and performance gets bad. So, we added the "mb_optimize_scan" mount > option (which is on by default now). It keeps track of some group lists, > so when we need a free block, we can just grab a likely group from the > right list. This saves time and makes block allocation much faster. > > But when multiple processes or containers are doing similar things, like > constantly allocating 8k blocks, they all try to use the same block group > in the same list. Even just two processes doing this can cut the IOPS in > half. For example, one container might do 300,000 IOPS, but if you run two > at the same time, the total is only 150,000. > > Since we can already look at block groups in a non-linear way, the first > and last groups in the same list are basically the same for finding a block > right now. Therefore, add an ext4_try_lock_group() helper function to skip > the current group when it is locked by another process, thereby avoiding > contention with other processes. This helps ext4 make better use of having > multiple block groups. > > Also, to make sure we don't skip all the groups that have free space > when allocating blocks, we won't try to skip busy groups anymore when > ac_criteria is CR_ANY_FREE. > > Performance test data follows: > > CPU: HUAWEI Kunpeng 920 > Memory: 480GB > Disk: 480GB SSD SATA 3.2 > Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers. > Observation: Average fallocate operations per container per second. > > base patched > mb_optimize_scan=0 3588 6755 (+88.2%) > mb_optimize_scan=1 3588 4302 (+19.8%) The patch looks mostly good. Same observations about mb_optimize_scan=1 improving less. We can continue this discussion in my reply to the cover letter. That being said, I have some minor suggestions: > > Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx> > --- > fs/ext4/ext4.h | 23 ++++++++++++++--------- > fs/ext4/mballoc.c | 14 +++++++++++--- > 2 files changed, 25 insertions(+), 12 deletions(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 5a20e9cd7184..9c665a620a46 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -3494,23 +3494,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi) > return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD); > } > > +static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group) > +{ > + if (!spin_trylock(ext4_group_lock_ptr(sb, group))) > + return false; > + /* > + * We're able to grab the lock right away, so drop the lock > + * contention counter. > + */ > + atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0); > + return true; > +} > + > static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group) > { > - spinlock_t *lock = ext4_group_lock_ptr(sb, group); > - if (spin_trylock(lock)) > - /* > - * We're able to grab the lock right away, so drop the > - * lock contention counter. > - */ > - atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0); > - else { > + if (!ext4_try_lock_group(sb, group)) { > /* > * The lock is busy, so bump the contention counter, > * and then wait on the spin lock. > */ > atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1, > EXT4_MAX_CONTENTION); > - spin_lock(lock); > + spin_lock(ext4_group_lock_ptr(sb, group)); > } > } > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 1e98c5be4e0a..5c13d9f8a1cc 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context > bb_largest_free_order_node) { > if (sbi->s_mb_stats) > atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]); > - if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) { > + if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) && > + !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) { Maybe reversing the && order to be (!spin_is_locked() && ext4_mb_good_group()) would be better? > *group = iter->bb_group; > ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; > read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); > @@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o > list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) { > if (sbi->s_mb_stats) > atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); > - if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) { > + if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) && > + !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) { same as above > grp = iter; > break; > } > @@ -2911,7 +2913,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > if (err) > goto out; > > - ext4_lock_group(sb, group); > + /* skip busy group */ > + if (cr >= CR_ANY_FREE) { > + ext4_lock_group(sb, group); > + } else if (!ext4_try_lock_group(sb, group)) { > + ext4_mb_unload_buddy(&e4b); > + continue; > + } This in itself looks good. I am just thinking that now that we are deciding to skip locked groups, in the code above this one, shall we do something like: if (spin_is_locked(group_lock)) continue; err = ext4_mb_load_buddy(sb, group, &e4b); if (err) goto out; /* skip busy group */ if (cr >= CR_ANY_FREE) { ext4_lock_group(sb, group); } else if (!ext4_try_lock_group(sb, group)) { ext4_mb_unload_buddy(&e4b); continue; } With this we can even avoid loading the folio as well. Regards, ojaswin > > /* > * We need to check again after locking the > -- > 2.46.1 >