Re: [PATCH] block: don't grab elevator lock during queue initialization

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Wed, 9 Apr 2025 19:16:03 +0530

On 4/9/25 5:16 PM, Ming Lei wrote:
>>>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
>>>>>
>>>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 777db89fdaa7..002d2fd20e0c 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
>>>>  {
>>>>         if (!ctrl->tagset)
>>>>                 return;
>>>> -       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>>>> -               blk_mq_quiesce_tagset(ctrl->tagset);
>>>> -       else
>>>> -               blk_mq_wait_quiesce_done(ctrl->tagset);
>>>> +       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
>>>> +               struct nvme_ns *ns;
>>>> +               int srcu_idx;
>>>> +
>>>> +               srcu_idx = srcu_read_lock(&ctrl->srcu);
>>>> +               list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>>>> +                               srcu_read_lock_held(&ctrl->srcu)) {
>>>> +                       if (!blk_queue_skip_tagset_quiesce(ns->queue))
>>>> +                               blk_mq_quiesce_queue_nowait(ns->queue);
>>>> +               }
>>>> +               srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>>> +       }
>>>> +       blk_mq_wait_quiesce_done(ctrl->tagset);
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);
>>>>
>>>> Here we iterate through ctrl->namespaces instead of relying on tag_list
>>>> and so we don't need to acquire ->tag_list_lock.
>>>
>>> How can you make sure all NSs are covered in this way? RCU/SRCU can't
>>> provide such kind of guarantee.
>>>
>> Why is that so? In fact, nvme_wait_freeze also iterates through 
>> the same ctrl->namespaces to freeze the queue.
> 
> It depends if nvme error handling needs to cover new coming NS,
> suppose it doesn't care, and you can change to srcu and bypass
> ->tag_list_lock.
> 
Yes new incoming NS may not be live yet when we iterate through 
ctrl->namespaces. So we don't need bother about it yet.
>>
>>>>
>>>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
>>>>>>> fortunately that is what nvme is doing.
>>>>>>>
>>>>>>>
>>>>>>>> If yes then it means that we should be able to grab ->elevator_lock
>>>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
>>>>>>>> order should be in each code path,
>>>>>>>>
>>>>>>>> __blk_mq_update_nr_hw_queues
>>>>>>>>     ->elevator_lock 
>>>>>>>>       ->freeze_lock
>>>>>>>
>>>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
>>>>>>> just make things worse. Why can't we disable elevator switch during
>>>>>>> updating nr_hw_queues?
>>>>>>>
>>>>>> I couldn't quite understand this. As we already first disable the elevator
>>>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
>>>>>> Once mapping is successful we switch back the elevator.
>>>>>
>>>>> Yes, but user still may switch elevator from none to others during the
>>>>> period, right?
>>>>>
>>>> Yes correct, that's possible. So your suggestion was to disable elevator
>>>> update while we're running __blk_mq_update_nr_hw_queues? And that way user
>>>> couldn't update elevator through sysfs (elv_iosched_store) while we update
>>>> nr_hw_queues? If this is true then still how could it help solve lockdep
>>>> splat? 
>>>
>>> Then why do you think per-set lock can solve the lockdep splat?
>>>
>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
>>> involved wrt. switching elevator. If elevator switching is not allowed
>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
>>> lock?
>>>
>> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
>> However my question was if we were to not allow elevator switch while 
>> __blk_mq_update_nr_hw_queues is running then how would we implement it?
> 
> It can be done easily by tag_set->srcu.
Ok great if that's possible! But I'm not sure how it could be done in this
case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store
run in the writer/updater context. So you may still need lock? Can you
please send across a (informal) patch with your idea ?

> 
>> Do we need to synchronize with ->tag_list_lock? Or in another words,
>> elv_iosched_store would now depends on ->tag_list_lock ? 
> 
> ->tag_list_lock isn't involved.
> 
>>
>> On another note, if we choose to make ->elevator_lock per-set then 
>> our locking sequence in blk_mq_update_nr_hw_queues() would be,
> 
> There is also add/del disk vs. updating nr_hw_queues, do you want to
> add the per-set lock in add/del disk path too?

Ideally no we don't need to acquire ->elevator_lock in this path.
Please see below.

>>
>> blk_mq_update_nr_hw_queues
>>   -> tag_list_lock
>>     -> elevator_lock
>>      -> freeze_lock 
> 
> Actually freeze lock is already held for nvme before calling
> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
> frozen for updating nr_hw_queues, so the above order may not match
> with the existed code.
> 
> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
> 
I think we should consider (may be in different patch) updating
nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
its dependency on ->tag_list_lock.

>>
>> elv_iosched_store
>>   -> elevator_lock
>>     -> freeze_lock
> 
> I understand that the per-set elevator_lock is just for avoiding the
> nested elvevator lock class acquire? If we needn't to consider nvme
> or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed.
> 
> It is actually easy to sync elevator store vs. update nr_hw_queues.
> 
>>
>> So now ->freeze_lock should not depend on ->elevator_lock and that shall
>> help avoid few of the recent lockdep splats reported with fs_reclaim.
>> What do you think?
> 
> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
> related splat.
> 
> However, in del_gendisk(), freeze_lock is still held before calling
> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.

Yes agreed, however elevator_exit() called from del_gendisk() or 
elv_unregister_queue() called from blk_unregister_queue() are called 
after we unregister the queue. And if queue has been already unregistered
while invoking elevator_exit or del_gensidk then ideally we don't need to
acquire ->elevator_lock. The same is true for elevator_exit() called 
from add_disk_fwnode(). So IMO, we should update these paths to avoid 
acquiring ->elevator_lock.

Thanks,
--Nilay