Re: [PATCH 1/4] writeback: Avoid contention on wb->list_lock when switching inodes

Jan Kara <jack@xxxxxxx> · Wed, 10 Sep 2025 10:19:36 +0200

Hello Tejun,

On Tue 09-09-25 06:52:27, Tejun Heo wrote:
> On Tue, Sep 09, 2025 at 04:44:02PM +0200, Jan Kara wrote:
> > There can be multiple inode switch works that are trying to switch
> > inodes to / from the same wb. This can happen in particular if some
> > cgroup exits which owns many (thousands) inodes and we need to switch
> > them all. In this case several inode_switch_wbs_work_fn() instances will
> > be just spinning on the same wb->list_lock while only one of them makes
> > forward progress. This wastes CPU cycles and quickly leads to softlockup
> > reports and unusable system.
> > 
> > Instead of running several inode_switch_wbs_work_fn() instances in
> > parallel switching to the same wb and contending on wb->list_lock, run
> > just one instance and let the other isw items switching to this wb queue
> > behind the one being processed.
> > 
> > Signed-off-by: Jan Kara <jack@xxxxxxx>
> ...
> > +static void inode_switch_wbs_work_fn(struct work_struct *work)
> > +{
> > +	struct inode_switch_wbs_context *isw =
> > +		container_of(to_rcu_work(work), struct inode_switch_wbs_context, work);
> > +	struct bdi_writeback *new_wb = isw->new_wb;
> > +	bool switch_running;
> > +
> > +	spin_lock_irq(&new_wb->work_lock);
> > +	switch_running = !list_empty(&new_wb->switch_wbs_ctxs);
> > +	list_add_tail(&isw->list, &new_wb->switch_wbs_ctxs);
> > +	spin_unlock_irq(&new_wb->work_lock);
> > +
> > +	/*
> > +	 * Let's leave the real work for the running worker since we'd just
> > +	 * contend with it on wb->list_lock anyway.
> > +	 */
> > +	if (switch_running)
> > +		return;
> > +
> > +	/* OK, we will be doing the switching work */
> > +	wb_get(new_wb);
> > +	spin_lock_irq(&new_wb->work_lock);
> > +	while (!list_empty(&new_wb->switch_wbs_ctxs)) {
> > +		isw = list_first_entry(&new_wb->switch_wbs_ctxs,
> > +				       struct inode_switch_wbs_context, list);
> > +		spin_unlock_irq(&new_wb->work_lock);
> > +		process_inode_switch_wbs_work(isw);
> > +		spin_lock_irq(&new_wb->work_lock);
> > +		list_del(&isw->list);
> > +		kfree(isw);
> > +	}
> > +	spin_unlock_irq(&new_wb->work_lock);
> > +	wb_put(new_wb);
> > +}
> 
> Would it be easier to achieve the same effect if we just reduced @max_active
> when creating inode_switch_wbs? If we update cgroup_writeback_init() to use
> the following instead:
> 
>         isw_wq = alloc_workqueue("inode_switch_wbs", WQ_UNBOUND, 1);
> 
> Wouldn't that achieve the same thing? Note the addition of WQ_UNBOUND isn't
> strictly necessary but we're in the process of defaulting to unbound
> workqueues, so might as well update it together. I can't think of any reason
> why this would require per-cpu behavior.

Well, reducing @max_active to 1 will certainly deal with the list_lock
contention as well. But I didn't want to do that as on a busy container
system I assume there can be switching happening between different pairs of
cgroups. With the approach in this patch switches with different target
cgroups can still run in parallel. I don't have any real world data to back
that assumption so if you think this parallelism isn't really needed and we
are fine with at most one switch happening in the system, switching
max_active to 1 is certainly simple enough.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR