Re: [PATCH 1/4] ext4: add ext4_try_lock_group() to skip busy groups

Baokun Li <libaokun1@xxxxxxxxxx> · Fri, 30 May 2025 16:20:44 +0800

On 2025/5/28 23:05, Ojaswin Mujoo wrote:
On Fri, May 23, 2025 at 04:58:18PM +0800, libaokun@xxxxxxxxxxxxxxx wrote:
From: Baokun Li <libaokun1@xxxxxxxxxx>

When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.

But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.

Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.

Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.

Performance test data follows:

CPU: HUAWEI Kunpeng 920
Memory: 480GB
Disk: 480GB SSD SATA 3.2
Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
Observation: Average fallocate operations per container per second.

                       base    patched
mb_optimize_scan=0    3588    6755 (+88.2%)
mb_optimize_scan=1    3588    4302 (+19.8%)
The patch looks mostly good. Same observations about mb_optimize_scan=1
improving less. We can continue this discussion in my reply to the cover
letter. That being said, I have some minor suggestions:
Thanks for the review!

Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
---
  fs/ext4/ext4.h    | 23 ++++++++++++++---------
  fs/ext4/mballoc.c | 14 +++++++++++---
  2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5a20e9cd7184..9c665a620a46 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3494,23 +3494,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
  	return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
  }
  
+static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group)
+{
+	if (!spin_trylock(ext4_group_lock_ptr(sb, group)))
+		return false;
+	/*
+	 * We're able to grab the lock right away, so drop the lock
+	 * contention counter.
+	 */
+	atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
+	return true;
+}
+
  static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
  {
-	spinlock_t *lock = ext4_group_lock_ptr(sb, group);
-	if (spin_trylock(lock))
-		/*
-		 * We're able to grab the lock right away, so drop the
-		 * lock contention counter.
-		 */
-		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
-	else {
+	if (!ext4_try_lock_group(sb, group)) {
  		/*
  		 * The lock is busy, so bump the contention counter,
  		 * and then wait on the spin lock.
  		 */
  		atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
  				  EXT4_MAX_CONTENTION);
-		spin_lock(lock);
+		spin_lock(ext4_group_lock_ptr(sb, group));
  	}
  }
  
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1e98c5be4e0a..5c13d9f8a1cc 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context
  				    bb_largest_free_order_node) {
  			if (sbi->s_mb_stats)
  				atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]);
-			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) {
+			if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) &&
+			    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {
Maybe reversing the && order to be (!spin_is_locked() && ext4_mb_good_group()) would be better?
Yeah.
  				*group = iter->bb_group;
  				ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED;
  				read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
@@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o
  	list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) {
  		if (sbi->s_mb_stats)
  			atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]);
-		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) {
+		if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) &&
+		    !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) {
same as above
Okay.
  
  			grp = iter;
  			break;
  		}
@@ -2911,7 +2913,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
  			if (err)
  				goto out;
  
-			ext4_lock_group(sb, group);
+			/* skip busy group */
+			if (cr >= CR_ANY_FREE) {
+				ext4_lock_group(sb, group);
+			} else if (!ext4_try_lock_group(sb, group)) {
+				ext4_mb_unload_buddy(&e4b);
+				continue;
+			}
This in itself looks good. I am just thinking that now that we are
deciding to skip locked groups, in the code above this one, shall we do
something like:

       
       if (spin_is_locked(group_lock))
         continue;
       
       err = ext4_mb_load_buddy(sb, group, &e4b);
       if (err)
         goto out;

       /* skip busy group */
       if (cr >= CR_ANY_FREE) {
         ext4_lock_group(sb, group);
       } else if (!ext4_try_lock_group(sb, group)) {
         ext4_mb_unload_buddy(&e4b);
         continue;
       }

With this we can even avoid loading the folio as well.
I previously assumed that for busy groups, the buddy was already loaded,
so reloading it would incur minimal overhead. However, I was mistaken.

After implementing a change, the proportion of time spent in
ext4_mb_load_buddy() decreased from 3.6% to 1.7%, resulting in
approximately a 2% performance improvement.

Thank you for your suggestion！
I will prevent unnecessary buddy loading in the next version.

Cheers,
Baokun