Based on this parallel writeback testing on XFS [1] and prior discussions, we believe that the features and architecture of filesystems must be considered to optimize parallel writeback performance. We introduce a filesystem interface to control the assignment of inodes to writeback contexts based on the following insights: - Following Dave's earlier suggestion [2], filesystems should determine both the number of writeback contexts and how inodes are assigned to them. Therefore, we provide an interface for filesystems to customize their inode assignment strategy for writeback. - Instead of dynamically changing the number of writeback contexts during filesystem initialization, we allow filesystems to determine how many contexts it require, and push inodes only to those designated contexts. To implement this, we have made the following changes: - Introduces get_inode_wb_ctx_idx() in super_operations, called from fetch_bdi_writeback_ctx(), allowing filesystems to provide a writeback context index for an inode. This generic interface can be extended to all filesystems. - Implements XFS adaptation. To address contention during delayed allocation, all inodes from the same Allocation Group bind to a unique writeback context. Through this testing [1], we obtained the following results. Our approach achieves performance similar to nr_wb_ctx=4 but shows no further improvement. After collecting perf data, the results show that lock contention during delayed allocation remains unresolved. System config: Number of CPUs = 8 System RAM = 4G For XFS number of AGs = 4 Used NVMe SSD of 20GB (emulated via QEMU) Result: Default: Parallel Writeback (nr_wb_ctx = 1) : 16.4MiB/s Parallel Writeback (nr_wb_ctx = 2) : 32.3MiB/s Parallel Writeback (nr_wb_ctx = 3) : 39.0MiB/s Parallel Writeback (nr_wb_ctx = 4) : 47.3MiB/s Parallel Writeback (nr_wb_ctx = 5) : 45.7MiB/s Parallel Writeback (nr_wb_ctx = 6) : 46.0MiB/s Parallel Writeback (nr_wb_ctx = 7) : 42.7MiB/s Parallel Writeback (nr_wb_ctx = 8) : 40.8MiB/s After optimization (4 AGs utilized): Parallel Writeback (nr_wb_ctx = 8) : 47.1MiB/s (4 active contexts) These results lead to the following discussions: 1. How can we design workloads that better expose the lock contention of delay allocation? 2. Given the lack of performance improvements, is there an oversight or misunderstanding of the implementation of the xfs interface, or is there some other performance bottleneck? [1] https://lore.kernel.org/linux-fsdevel/CALYkqXpOBb1Ak2kEKWbO2Kc5NaGwb4XsX1q4eEaNWmO_4SQq9w@xxxxxxxxxxxxxx/ [2] https://lore.kernel.org/linux-fsdevel/Z5qw_1BOqiFum5Dn@xxxxxxxxxxxxxxxxxxx/ wangyufei (2): writeback: add support for filesystems to affine inodes to specific writeback ctx xfs: implement get_inode_wb_ctx_idx() for per-AG parallel writeback fs/xfs/xfs_super.c | 14 ++++++++++++++ include/linux/backing-dev.h | 3 +++ include/linux/fs.h | 1 + 3 files changed, 18 insertions(+) -- 2.34.1