On Tue, Sep 09, 2025 at 04:44:04PM +0200, Jan Kara wrote: > With lazytime mount option enabled we can be switching many dirty inodes > on cgroup exit to the parent cgroup. The numbers observed in practice > when systemd slice of a large cron job exits can easily reach hundreds > of thousands or millions. The logic in inode_do_switch_wbs() which sorts > the inode into appropriate place in b_dirty list of the target wb > however has linear complexity in the number of dirty inodes thus overall > time complexity of switching all the inodes is quadratic leading to > workers being pegged for hours consuming 100% of the CPU and switching > inodes to the parent wb. > > Simple reproducer of the issue: > FILES=10000 > # Filesystem mounted with lazytime mount option > MNT=/mnt/ > echo "Creating files and switching timestamps" > for (( j = 0; j < 50; j ++ )); do > mkdir $MNT/dir$j > for (( i = 0; i < $FILES; i++ )); do > echo "foo" >$MNT/dir$j/file$i > done > touch -a -t 202501010000 $MNT/dir$j/file* > done > wait > echo "Syncing and flushing" > sync > echo 3 >/proc/sys/vm/drop_caches > > echo "Reading all files from a cgroup" > mkdir /sys/fs/cgroup/unified/mycg1 || exit > echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit > for (( j = 0; j < 50; j ++ )); do > cat /mnt/dir$j/file* >/dev/null & > done > wait > echo "Switching wbs" > # Now rmdir the cgroup after the script exits > > We need to maintain b_dirty list ordering to keep writeback happy so > instead of sorting inode into appropriate place just append it at the > end of the list and clobber dirtied_time_when. This may result in inode > writeback starting later after cgroup switch however cgroup switches are > rare so it shouldn't matter much. Since the cgroup had write access to > the inode, there are no practical concerns of the possible DoS issues. > > Signed-off-by: Jan Kara <jack@xxxxxxx> Acked-by: Tejun Heo <tj@xxxxxxxxxx> Thanks. -- tejun