On 6/30/25 11:54 AM, Hannes Reinecke wrote: > On 6/30/25 07:21, Nilay Shroff wrote: >> Move scheduler tags (sched_tags) allocation and deallocation outside >> both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues. >> This change breaks the dependency chain from the percpu allocator lock >> to the elevator lock, helping to prevent potential deadlocks, as >> observed in the reported lockdep splat[1]. >> >> This commit introduces batch allocation and deallocation helpers for >> sched_tags, which are now used from within __blk_mq_update_nr_hw_queues >> routine while iterating through the tagset. >> >> With this change, all sched_tags memory management is handled entirely >> outside the ->elevator_lock and the ->freeze_lock context, thereby >> eliminating the lock dependency that could otherwise manifest during >> nr_hw_queues updates. >> >> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@xxxxxxxxxxxxx/ >> >> Reported-by: Stefan Haberland <sth@xxxxxxxxxxxxx> >> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@xxxxxxxxxxxxx/ >> Signed-off-by: Nilay Shroff <nilay@xxxxxxxxxxxxx> >> --- >> block/blk-mq-sched.c | 63 ++++++++++++++++++++++++++++++++++++++++++++ >> block/blk-mq-sched.h | 4 +++ >> block/blk-mq.c | 11 +++++++- >> block/blk.h | 2 +- >> block/elevator.c | 4 +-- >> 5 files changed, 80 insertions(+), 4 deletions(-) >> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c >> index 2d6d1ebdd8fb..da802df34a8c 100644 >> --- a/block/blk-mq-sched.c >> +++ b/block/blk-mq-sched.c >> @@ -427,6 +427,30 @@ void blk_mq_free_sched_tags(struct elevator_tags *et, >> kfree(et); >> } >> +void blk_mq_free_sched_tags_batch(struct xarray *et_table, >> + struct blk_mq_tag_set *set) >> +{ >> + struct request_queue *q; >> + struct elevator_tags *et; >> + >> + lockdep_assert_held_write(&set->update_nr_hwq_lock); >> + >> + list_for_each_entry(q, &set->tag_list, tag_set_list) { >> + /* >> + * Accessing q->elevator without holding q->elevator_lock is >> + * safe because we're holding here set->update_nr_hwq_lock in >> + * the writer context. So, scheduler update/switch code (which >> + * acquires the same lock but in the reader context) can't run >> + * concurrently. >> + */ >> + if (q->elevator) { >> + et = xa_load(et_table, q->id); >> + if (et) >> + blk_mq_free_sched_tags(et, set); >> + } >> + } >> +} >> + > Odd. Do we allow empty sched_tags? > Why isn't this an error (or at least a warning)? > We don't allow empty sched_tags and xa_load is not supposed to return an empty sched_tags. It's just a paranoia guard that we evaluate whether @et is non-NULL. As this function returns nothing we don't return any error code to the caller here. But yes, we may add a warning here if that seems reasonable. I'd add a 'WARN_ON(!et)' and update it in the next patch. Thanks, --Nilay