[RFC 0/2] writeback: add support for filesystems to optimize parallel writeback

wangyufei <wangyufei@xxxxxxxx> · Sun, 14 Sep 2025 20:11:07 +0800

Based on this parallel writeback testing on XFS [1] and prior discussions,
we believe that the features and architecture of filesystems must be
considered to optimize parallel writeback performance.

We introduce a filesystem interface to control the assignment of inodes
to writeback contexts based on the following insights:
- Following Dave's earlier suggestion [2], filesystems should determine
both the number of writeback contexts and how inodes are assigned to them.
Therefore, we provide an interface for filesystems to customize their
inode assignment strategy for writeback.
- Instead of dynamically changing the number of writeback contexts during
filesystem initialization, we allow filesystems to determine how many
contexts it require, and push inodes only to those designated contexts.

To implement this, we have made the following changes:
- Introduces get_inode_wb_ctx_idx() in super_operations, called from
fetch_bdi_writeback_ctx(), allowing filesystems to provide a writeback 
context index for an inode. This generic interface can be extended to 
all filesystems.
- Implements XFS adaptation. To address contention during delayed
allocation, all inodes from the same Allocation Group bind to a unique
writeback context.

Through this testing [1], we obtained the following results. Our approach
achieves performance similar to nr_wb_ctx=4 but shows no further
improvement. After collecting perf data, the results show that lock
contention during delayed allocation remains unresolved.

System config:
Number of CPUs = 8
System RAM = 4G
For XFS number of AGs = 4
Used NVMe SSD of 20GB (emulated via QEMU)

Result:

Default:
Parallel Writeback (nr_wb_ctx = 1)    :  16.4MiB/s
Parallel Writeback (nr_wb_ctx = 2)    :  32.3MiB/s
Parallel Writeback (nr_wb_ctx = 3)    :  39.0MiB/s
Parallel Writeback (nr_wb_ctx = 4)    :  47.3MiB/s
Parallel Writeback (nr_wb_ctx = 5)    :  45.7MiB/s
Parallel Writeback (nr_wb_ctx = 6)    :  46.0MiB/s
Parallel Writeback (nr_wb_ctx = 7)    :  42.7MiB/s
Parallel Writeback (nr_wb_ctx = 8)    :  40.8MiB/s

After optimization (4 AGs utilized):
Parallel Writeback (nr_wb_ctx = 8)    :  47.1MiB/s (4 active contexts)

These results lead to the following discussions:
1. How can we design workloads that better expose the lock contention of 
delay allocation?
2. Given the lack of performance improvements, is there an oversight or 
misunderstanding of the implementation of the xfs interface, or is there 
some other performance bottleneck?

[1] 
https://lore.kernel.org/linux-fsdevel/CALYkqXpOBb1Ak2kEKWbO2Kc5NaGwb4XsX1q4eEaNWmO_4SQq9w@xxxxxxxxxxxxxx/
[2] 
https://lore.kernel.org/linux-fsdevel/Z5qw_1BOqiFum5Dn@xxxxxxxxxxxxxxxxxxx/

wangyufei (2):
  writeback: add support for filesystems to affine inodes to specific
    writeback ctx
  xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback

 fs/xfs/xfs_super.c          | 14 ++++++++++++++
 include/linux/backing-dev.h |  3 +++
 include/linux/fs.h          |  1 +
 3 files changed, 18 insertions(+)

-- 
2.34.1