[Bug 217965] ext4(?) regression since 6.5.0 on sata hdd

bugzilla-daemon@xxxxxxxxxx · Mon, 18 Aug 2025 13:48:04 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=217965

mingyu.he (mingyu.he@xxxxxxxxxx) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mingyu.he@xxxxxxxxxx

--- Comment #72 from mingyu.he (mingyu.he@xxxxxxxxxx) ---
I find out the root cause of this problem.
In previous discussion, we didn't find out an exact root cause of this problem.
But the Ojaswin Mujoo made an effective fallback solution.

Although this problem is some what an "old problem",
   I think it still meaningful to reply to you for root cause as it distrubed
lots of users back then.

Here I will write the root cause first and then put an exact reproduce program
for testing.

** Root Cause **
For some Newbies like me, perhaps we should know this first:
Raid will cut all the logic blocks into many intervals. The length of an
interval is stripe.
We can use `tune2fs -l /dev/sdX` to find or `mount` to see it.
The layout may looks like this
[stripe][stripe][stripe][stripe][stripe][stripe]
   [ group size ][ group size ][ group size ]

In function 'ext4_mb_scan_aligned', it try to find a start point of an raid
interval in a group.
And then calculate the remaining block length of this group.
If the remaining length(which should be free and continuous) is not enough for
a stripe,
   it will return, and then find a next group.
[ stripe ] [ stripe ]
          |  here
    [ group size ]

The core problem is the action of finding next group.
In function 'ext4_mb_choose_next_group', code changed since this patch '[PATCH
v2 00/12] multiblock allocator improvements'
https://lore.kernel.org/all/cover.1685449706.git.ojaswin@xxxxxxxxxxxxx/

The author changed the fragment order RB tree into list for better performance.
However, the function 'ext4_mb_find_good_group_avg_frag_lists' will always
returns the same group every time,
   which makes upper layer pass the same group to 'ext4_mb_scan_aligned'. And
thus aligned scan always fails.
So it will end until the loop var 'i' (in ext4_mb_regular_allocator) runs into
n_groups.

Here is the stats I collected in 'ext4_mb_scan_aligned' -> 'mb_find_extent'
I set stripe to 30000 for easilier reproduce the problem.
```
Attaching 5 probes...
Tracing ext4_mb_regular_allocator... Hit Ctrl-C to stop.
find_ex, tid=22280, group_id=175542, block(i)=9744, needed(stripe)=30000,
ret(max)=23024
find_ex, tid=22280, group_id=175543, block(i)=6976, needed(stripe)=30000,
ret(max)=25792
find_ex, tid=22280, group_id=175544, block(i)=4208, needed(stripe)=30000,
ret(max)=28560
find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000,
ret(max)=24304
find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000,
ret(max)=24304
find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000,
ret(max)=24304
find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000,
ret(max)=24304
find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000,
ret(max)=24304
...
countless same lines
```

The first few lines is the linear choose section. And later is avg_frag_list's
selections.
Also, it you are in multi-thread running env, the chooser will probably find
the same group.
But the mb_regular_allocator need to get the spin lock of this group, which
inflames high cpu usage.
(Thanks to Baokun Li, he made an optimization for spin lock in recent patches
'ext4: better scalability for ext4 block allocation')

And in 5.15 version, the RB tree works fine.
```
find_ex, tid=17292, group_id=603069, block(i)=25008, needed(strip)=30000,
ret(max)=7760
...linear search
find_ex, tid=17292, group_id=603072, block(i)=16704, needed(strip)=30000,
ret(max)=16064

find_ex, tid=17292, group_id=278911, block(i)=24352, needed(strip)=30000,
ret(max)=8416
find_ex, tid=17292, group_id=279167, block(i)=5744, needed(strip)=30000,
ret(max)=27024
...tree search
find_ex, tid=17292, group_id=280447, block(i)=2704, needed(strip)=30000,
ret(max)=30064
```

And in latest version (6.17.0-rc2) still works fine (Baokun Li made some
optimization for allocator and frag order list)
Note that for observe the work of new data structure, I comment out the
fallback logic.
```
find_ex, tid=31612, group_id=1423, block(i)=9984, needed(stripe)=32752,
ret(max)=22784
find_ex, tid=31612, group_id=1425, block(i)=9952, needed(stripe)=32752,
ret(max)=22816
find_ex, tid=31612, group_id=1426, block(i)=9936, needed(stripe)=32752,
ret(max)=22832
find_ex, tid=31612, group_id=1427, block(i)=9920, needed(stripe)=32752,
ret(max)=22848
find_ex, tid=31612, group_id=1428, block(i)=9904, needed(stripe)=32752,
ret(max)=22864
find_ex, tid=31612, group_id=1429, block(i)=9888, needed(stripe)=32752,
ret(max)=22880
find_ex, tid=31612, group_id=1430, block(i)=9872, needed(stripe)=32752,
ret(max)=22896
find_ex, tid=31612, group_id=1431, block(i)=9856, needed(stripe)=32752,
ret(max)=22912
find_ex, tid=31612, group_id=1432, block(i)=9840, needed(stripe)=32752,
ret(max)=22928
find_ex, tid=31612, group_id=1433, block(i)=9824, needed(stripe)=32752,
ret(max)=22944
find_ex, tid=31612, group_id=1434, block(i)=9808, needed(stripe)=32752,
ret(max)=22960
find_ex, tid=31612, group_id=1435, block(i)=9792, needed(stripe)=32752,
ret(max)=22976
```

** Reproduce Method **
Before 6.8 (or you delete the fallback logic)
with ext4 and raid.
# higher stripe, higher probability. But lesser than 32768(the default of
blocks in a group as it will cut request length)
mount -o remount,stripe=35000 /dev/sdX

# request allocation for 35000 blocks. Need aligned to stripe.
./test_C_program 35000

the C program (I delete error checking for keeping simple):
```
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>

#define BLOCK_SIZE 4096

int main(int argc, char **argv) {
   if (argc < 2) {
      fprintf(stderr, "Usage: %s <request blocks>\n", argv[0]);
      return 1;
   }

   int blocks = atoi(argv[1]);
   off_t FILE_SIZE = (off_t)blocks * BLOCK_SIZE;

   int fd = open("source_file.txt", O_CREAT | O_WRONLY | O_TRUNC, 0644);

   fallocate(fd, 0, 0, FILE_SIZE);

   char *data = malloc(FILE_SIZE);

   memset(data, 'A', FILE_SIZE);
   write(fd, data, FILE_SIZE);
   rename("source_file.txt", "target_file.txt");

   printf("rename succeeded\n");

   free(data);
   close(fd);
   return 0;
}
```

If your stripe is relatively small, here is the method from
carlos@xxxxxxxxxxxxxx.
https://marc.info/?l=linux-raid&m=170327844709957&w=2

This method may not be exact. But very simple to reproduce for small stripe.
However, I need to run the parallel version to reproduce in my machine to
reproduce. Here are they:
mkdir 1 2 3 4 5

xzcat linux-6.16.tar.xz | tar x -C ./1 -f - &
xzcat linux-6.16.tar.xz | tar x -C ./2 -f - &
xzcat linux-6.16.tar.xz | tar x -C ./3 -f - &
xzcat linux-6.16.tar.xz | tar x -C ./4 -f - &
xzcat linux-6.16.tar.xz | tar x -C ./5 -f - &

Wait for kworker to flush dirty pages. You can use `top` to see kworker use
100% cpu for a long time.

Best Regards,
Mingyu He

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.