On 2025/6/28 2:31, Jan Kara wrote:
On Mon 23-06-25 15:32:52, Baokun Li wrote:
When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.
However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.
To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the CPU count rounded up to the
nearest power of 2. To ensure a consistent goal for each inode, we select
the corresponding goal by taking the inode number modulo the total number
of goals.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
| Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 |
Disk: 960GB SSD |-------------------------|-------------------------|
| base | patched | base | patched |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 7612 | 19699 (+158%) | 21647 | 53093 (+145%) |
mb_optimize_scan=1 | 7568 | 9862 (+30.3%) | 9117 | 14401 (+57.9%) |
Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
...
+/*
+ * Number of mb last groups
+ */
+#ifdef CONFIG_SMP
+#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
+#else
+#define MB_LAST_GROUPS 1
+#endif
+
I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
distribution kernels (it is just a theoretical maximum for the number of
CPUs the kernel can support)
nr_cpu_ids is generally equal to num_possible_cpus(). Only when
CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
which represents the maximum number of supported CPUs.
which seems like far too much for small
filesystems with say 100 block groups.
It does make sense.
I'd rather pick the array size like:
min(num_possible_cpus(), sbi->s_groups_count/4)
to
a) don't have too many slots so we still concentrate big allocations in
somewhat limited area of the filesystem (a quarter of block groups here).
b) have at most one slot per CPU the machine hardware can in principle
support.
Honza
You're right, we should consider the number of block groups when setting
the number of global goals.
However, a server's rootfs can often be quite small, perhaps only tens of
GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
still limit the filesystem's scalability. Furthermore, after supporting
LBS, the number of block groups will sharply decrease.
How about we directly use sbi->s_groups_count (which would effectively be
min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
avoid zero values.
Cheers,
Baokun