https://bugzilla.kernel.org/show_bug.cgi?id=217965 mingyu.he (mingyu.he@xxxxxxxxxx) changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mingyu.he@xxxxxxxxxx --- Comment #72 from mingyu.he (mingyu.he@xxxxxxxxxx) --- I find out the root cause of this problem. In previous discussion, we didn't find out an exact root cause of this problem. But the Ojaswin Mujoo made an effective fallback solution. Although this problem is some what an "old problem", I think it still meaningful to reply to you for root cause as it distrubed lots of users back then. Here I will write the root cause first and then put an exact reproduce program for testing. ** Root Cause ** For some Newbies like me, perhaps we should know this first: Raid will cut all the logic blocks into many intervals. The length of an interval is stripe. We can use `tune2fs -l /dev/sdX` to find or `mount` to see it. The layout may looks like this [stripe][stripe][stripe][stripe][stripe][stripe] [ group size ][ group size ][ group size ] In function 'ext4_mb_scan_aligned', it try to find a start point of an raid interval in a group. And then calculate the remaining block length of this group. If the remaining length(which should be free and continuous) is not enough for a stripe, it will return, and then find a next group. [ stripe ] [ stripe ] | here [ group size ] The core problem is the action of finding next group. In function 'ext4_mb_choose_next_group', code changed since this patch '[PATCH v2 00/12] multiblock allocator improvements' https://lore.kernel.org/all/cover.1685449706.git.ojaswin@xxxxxxxxxxxxx/ The author changed the fragment order RB tree into list for better performance. However, the function 'ext4_mb_find_good_group_avg_frag_lists' will always returns the same group every time, which makes upper layer pass the same group to 'ext4_mb_scan_aligned'. And thus aligned scan always fails. So it will end until the loop var 'i' (in ext4_mb_regular_allocator) runs into n_groups. Here is the stats I collected in 'ext4_mb_scan_aligned' -> 'mb_find_extent' I set stripe to 30000 for easilier reproduce the problem. ``` Attaching 5 probes... Tracing ext4_mb_regular_allocator... Hit Ctrl-C to stop. find_ex, tid=22280, group_id=175542, block(i)=9744, needed(stripe)=30000, ret(max)=23024 find_ex, tid=22280, group_id=175543, block(i)=6976, needed(stripe)=30000, ret(max)=25792 find_ex, tid=22280, group_id=175544, block(i)=4208, needed(stripe)=30000, ret(max)=28560 find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000, ret(max)=24304 find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000, ret(max)=24304 find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000, ret(max)=24304 find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000, ret(max)=24304 find_ex, tid=22280, group_id=127, block(i)=8464, needed(stripe)=30000, ret(max)=24304 ... countless same lines ``` The first few lines is the linear choose section. And later is avg_frag_list's selections. Also, it you are in multi-thread running env, the chooser will probably find the same group. But the mb_regular_allocator need to get the spin lock of this group, which inflames high cpu usage. (Thanks to Baokun Li, he made an optimization for spin lock in recent patches 'ext4: better scalability for ext4 block allocation') And in 5.15 version, the RB tree works fine. ``` find_ex, tid=17292, group_id=603069, block(i)=25008, needed(strip)=30000, ret(max)=7760 ...linear search find_ex, tid=17292, group_id=603072, block(i)=16704, needed(strip)=30000, ret(max)=16064 find_ex, tid=17292, group_id=278911, block(i)=24352, needed(strip)=30000, ret(max)=8416 find_ex, tid=17292, group_id=279167, block(i)=5744, needed(strip)=30000, ret(max)=27024 ...tree search find_ex, tid=17292, group_id=280447, block(i)=2704, needed(strip)=30000, ret(max)=30064 ``` And in latest version (6.17.0-rc2) still works fine (Baokun Li made some optimization for allocator and frag order list) Note that for observe the work of new data structure, I comment out the fallback logic. ``` find_ex, tid=31612, group_id=1423, block(i)=9984, needed(stripe)=32752, ret(max)=22784 find_ex, tid=31612, group_id=1425, block(i)=9952, needed(stripe)=32752, ret(max)=22816 find_ex, tid=31612, group_id=1426, block(i)=9936, needed(stripe)=32752, ret(max)=22832 find_ex, tid=31612, group_id=1427, block(i)=9920, needed(stripe)=32752, ret(max)=22848 find_ex, tid=31612, group_id=1428, block(i)=9904, needed(stripe)=32752, ret(max)=22864 find_ex, tid=31612, group_id=1429, block(i)=9888, needed(stripe)=32752, ret(max)=22880 find_ex, tid=31612, group_id=1430, block(i)=9872, needed(stripe)=32752, ret(max)=22896 find_ex, tid=31612, group_id=1431, block(i)=9856, needed(stripe)=32752, ret(max)=22912 find_ex, tid=31612, group_id=1432, block(i)=9840, needed(stripe)=32752, ret(max)=22928 find_ex, tid=31612, group_id=1433, block(i)=9824, needed(stripe)=32752, ret(max)=22944 find_ex, tid=31612, group_id=1434, block(i)=9808, needed(stripe)=32752, ret(max)=22960 find_ex, tid=31612, group_id=1435, block(i)=9792, needed(stripe)=32752, ret(max)=22976 ``` ** Reproduce Method ** Before 6.8 (or you delete the fallback logic) with ext4 and raid. # higher stripe, higher probability. But lesser than 32768(the default of blocks in a group as it will cut request length) mount -o remount,stripe=35000 /dev/sdX # request allocation for 35000 blocks. Need aligned to stripe. ./test_C_program 35000 the C program (I delete error checking for keeping simple): ``` #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <string.h> #include <errno.h> #include <sys/stat.h> #include <sys/types.h> #define BLOCK_SIZE 4096 int main(int argc, char **argv) { if (argc < 2) { fprintf(stderr, "Usage: %s <request blocks>\n", argv[0]); return 1; } int blocks = atoi(argv[1]); off_t FILE_SIZE = (off_t)blocks * BLOCK_SIZE; int fd = open("source_file.txt", O_CREAT | O_WRONLY | O_TRUNC, 0644); fallocate(fd, 0, 0, FILE_SIZE); char *data = malloc(FILE_SIZE); memset(data, 'A', FILE_SIZE); write(fd, data, FILE_SIZE); rename("source_file.txt", "target_file.txt"); printf("rename succeeded\n"); free(data); close(fd); return 0; } ``` If your stripe is relatively small, here is the method from carlos@xxxxxxxxxxxxxx. https://marc.info/?l=linux-raid&m=170327844709957&w=2 This method may not be exact. But very simple to reproduce for small stripe. However, I need to run the parallel version to reproduce in my machine to reproduce. Here are they: mkdir 1 2 3 4 5 xzcat linux-6.16.tar.xz | tar x -C ./1 -f - & xzcat linux-6.16.tar.xz | tar x -C ./2 -f - & xzcat linux-6.16.tar.xz | tar x -C ./3 -f - & xzcat linux-6.16.tar.xz | tar x -C ./4 -f - & xzcat linux-6.16.tar.xz | tar x -C ./5 -f - & Wait for kworker to flush dirty pages. You can use `top` to see kworker use 100% cpu for a long time. Best Regards, Mingyu He -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.