Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation

Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> · Wed, 28 May 2025 20:23:15 +0530

On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@xxxxxxxxxxxxxxx wrote:
> From: Baokun Li <libaokun1@xxxxxxxxxx>
> 
> Since servers have more and more CPUs, and we're running more containers
> on them, we've been using will-it-scale to test how well ext4 scales. The
> fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently
> on 64 containers revealed significant contention in block allocation/free,
> leading to much lower aggregate fallocate OPS compared to a single
> container (see below).
> 
>    1   |    2   |    4   |    8   |   16   |   32   |   64
> -------|--------|--------|--------|--------|--------|-------
> 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588
> 
> The main bottleneck was the ext4_lock_group(), which both block allocation
> and free fought over. While the block group for block free is fixed and
> unoptimizable, the block group for allocation is selectable. Consequently,
> the ext4_try_lock_group() helper function was added to avoid contention on
> busy groups, and you can see more in Patch 1.
> 
> After we fixed the ext4_lock_group bottleneck, another one showed up:
> s_md_lock. This lock protects different data when allocating and freeing
> blocks. We got rid of the s_md_lock call in block allocation by making
> stream allocation work per inode instead of globally. You can find more
> details in Patch 2.
> 
> Patches 3 and 4 are just some minor cleanups.
> 
> Performance test data follows:
> 
> CPU: HUAWEI Kunpeng 920
> Memory: 480GB
> Disk: 480GB SSD SATA 3.2
> Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
>  Observation: Average fallocate operations per container per second.

> 
> |--------|--------|--------|--------|--------|--------|--------|--------|
> |    -   |    1   |    2   |    4   |    8   |   16   |   32   |   64   |
> |--------|--------|--------|--------|--------|--------|--------|--------|
> |  base  | 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588  |
> |--------|--------|--------|--------|--------|--------|--------|--------|
> | linear | 286328 | 123102 | 119542 | 90653  | 60344  | 35302  | 23280  |
> |        | -3.0%  | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
> |--------|--------|--------|--------|--------|--------|--------|--------|
> |mb_optim| 292498 | 133305 | 103069 | 61727  | 29702  | 16845  | 10430  |
> |ize_scan| -0.9%  | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
> |--------|--------|--------|--------|--------|--------|--------|--------|

Hey Baokun, nice improvements! The proposed changes make sense to me,
however I suspect the performance improvements may come at a cost of
slight increase in fragmentation, which might affect rotational disks
especially. Maybe comparing e2freefrag numbers with and without the
patches might give a better insight into this.

Regardless the performance benefits are significant and I feel it is
good to have these patches.

I'll give my reviews individually as I'm still going through patch 2
However, I wanted to check on a couple things:

1. I believe you ran these in docker. Would you have any script etc open
   sourced that I can use to run some benchmarks on my end (and also
	 understand your test setup).

2. I notice we are getting way less throughput in mb_optimize_scan? I
   wonder why that is the case. Do you have some data on that? Are your
   tests starting on an empty FS, maybe in that case linear scan works a 
   bit better since almost all groups are empty. If so, what are the
   numbers like when we start with a fragmented FS?

   - Or maybe it is that the lazyinit thread has not yet initialized all
   the buddies yet which means we have lesser BGs in the freefrag list
   or the order list used by faster CRs. Hence, if they are locked we
   are falling more to CR_GOAL_LEN_SLOW. To check if this is the case,
   one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along
   the lines) before the benchmark, which forces init of all the group
   buddies thus populating all the lists used by mb_opt_scan. Maybe we
   can check if this gives better results.

3. Also, how much IO are we doing here, are we filling the whole FS?

Regards,
ojaswin