Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group

Jan Kara <jack@xxxxxxx> · Fri, 27 Jun 2025 20:19:40 +0200

On Mon 23-06-25 15:32:51, Baokun Li wrote:
> After we optimized the block group lock, we found another lock
> contention issue when running will-it-scale/fallocate2 with multiple
> processes. The fallocate's block allocation and the truncate's block
> release were fighting over the s_md_lock. The problem is, this lock
> protects totally different things in those two processes: the list of
> freed data blocks (s_freed_data_list) when releasing, and where to start
> looking for new blocks (mb_last_group) when allocating.
> 
> Now we only need to track s_mb_last_group and no longer need to track
> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> two are consistent, and we can ensure that the s_mb_last_group read is up
> to date by using smp_store_release/smp_load_acquire.
> 
> Besides, the s_mb_last_group data type only requires ext4_group_t
> (i.e., unsigned int), rendering unsigned long superfluous.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>

...

> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5cdae3bda072..3f103919868b 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>  	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>  	folio_get(ac->ac_buddy_folio);
>  	/* store last allocated for subsequent stream allocation */
> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		spin_lock(&sbi->s_md_lock);
> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> -		spin_unlock(&sbi->s_md_lock);
> -	}
> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);

Do you really need any kind of barrier (implied by smp_store_release())
here? I mean the store to s_mb_last_group is perfectly fine to be reordered
with other accesses from the thread, isn't it? As such it should be enough
to have WRITE_ONCE() here...

>  	/*
>  	 * As we've just preallocated more space than
>  	 * user requested originally, we store allocated
> @@ -2844,12 +2842,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  	}
>  
>  	/* if stream allocation is enabled, use global goal */
> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		/* TBD: may be hot point */
> -		spin_lock(&sbi->s_md_lock);
> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> -		spin_unlock(&sbi->s_md_lock);
> -	}
> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> +		/* pairs with smp_store_release in ext4_mb_use_best_found() */
> +		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);

... and READ_ONCE() here.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR