Re: [PATCH] block: don't grab elevator lock during queue initialization

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 9 Apr 2025 22:08:22 +0800

On Wed, Apr 09, 2025 at 07:16:03PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/9/25 5:16 PM, Ming Lei wrote:
> >>>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list".
> >>>>>
> >>>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock:
> >>>>
> >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>>> index 777db89fdaa7..002d2fd20e0c 100644
> >>>> --- a/drivers/nvme/host/core.c
> >>>> +++ b/drivers/nvme/host/core.c
> >>>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl)
> >>>>  {
> >>>>         if (!ctrl->tagset)
> >>>>                 return;
> >>>> -       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> >>>> -               blk_mq_quiesce_tagset(ctrl->tagset);
> >>>> -       else
> >>>> -               blk_mq_wait_quiesce_done(ctrl->tagset);
> >>>> +       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) {
> >>>> +               struct nvme_ns *ns;
> >>>> +               int srcu_idx;
> >>>> +
> >>>> +               srcu_idx = srcu_read_lock(&ctrl->srcu);
> >>>> +               list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
> >>>> +                               srcu_read_lock_held(&ctrl->srcu)) {
> >>>> +                       if (!blk_queue_skip_tagset_quiesce(ns->queue))
> >>>> +                               blk_mq_quiesce_queue_nowait(ns->queue);
> >>>> +               }
> >>>> +               srcu_read_unlock(&ctrl->srcu, srcu_idx);
> >>>> +       }
> >>>> +       blk_mq_wait_quiesce_done(ctrl->tagset);
> >>>>  }
> >>>>  EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues);
> >>>>
> >>>> Here we iterate through ctrl->namespaces instead of relying on tag_list
> >>>> and so we don't need to acquire ->tag_list_lock.
> >>>
> >>> How can you make sure all NSs are covered in this way? RCU/SRCU can't
> >>> provide such kind of guarantee.
> >>>
> >> Why is that so? In fact, nvme_wait_freeze also iterates through 
> >> the same ctrl->namespaces to freeze the queue.
> > 
> > It depends if nvme error handling needs to cover new coming NS,
> > suppose it doesn't care, and you can change to srcu and bypass
> > ->tag_list_lock.
> > 
> Yes new incoming NS may not be live yet when we iterate through 
> ctrl->namespaces. So we don't need bother about it yet.
> >>
> >>>>
> >>>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues,
> >>>>>>> fortunately that is what nvme is doing.
> >>>>>>>
> >>>>>>>
> >>>>>>>> If yes then it means that we should be able to grab ->elevator_lock
> >>>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking
> >>>>>>>> order should be in each code path,
> >>>>>>>>
> >>>>>>>> __blk_mq_update_nr_hw_queues
> >>>>>>>>     ->elevator_lock 
> >>>>>>>>       ->freeze_lock
> >>>>>>>
> >>>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way
> >>>>>>> just make things worse. Why can't we disable elevator switch during
> >>>>>>> updating nr_hw_queues?
> >>>>>>>
> >>>>>> I couldn't quite understand this. As we already first disable the elevator
> >>>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues().
> >>>>>> Once mapping is successful we switch back the elevator.
> >>>>>
> >>>>> Yes, but user still may switch elevator from none to others during the
> >>>>> period, right?
> >>>>>
> >>>> Yes correct, that's possible. So your suggestion was to disable elevator
> >>>> update while we're running __blk_mq_update_nr_hw_queues? And that way user
> >>>> couldn't update elevator through sysfs (elv_iosched_store) while we update
> >>>> nr_hw_queues? If this is true then still how could it help solve lockdep
> >>>> splat? 
> >>>
> >>> Then why do you think per-set lock can solve the lockdep splat?
> >>>
> >>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
> >>> involved wrt. switching elevator. If elevator switching is not allowed
> >>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
> >>> lock?
> >>>
> >> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
> >> However my question was if we were to not allow elevator switch while 
> >> __blk_mq_update_nr_hw_queues is running then how would we implement it?
> > 
> > It can be done easily by tag_set->srcu.
> Ok great if that's possible! But I'm not sure how it could be done in this
> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store
> run in the writer/updater context. So you may still need lock? Can you
> please send across a (informal) patch with your idea ?

Please see the attached patch which treats elv_iosched_store() as reader.

> 
> > 
> >> Do we need to synchronize with ->tag_list_lock? Or in another words,
> >> elv_iosched_store would now depends on ->tag_list_lock ? 
> > 
> > ->tag_list_lock isn't involved.
> > 
> >>
> >> On another note, if we choose to make ->elevator_lock per-set then 
> >> our locking sequence in blk_mq_update_nr_hw_queues() would be,
> > 
> > There is also add/del disk vs. updating nr_hw_queues, do you want to
> > add the per-set lock in add/del disk path too?
> 
> Ideally no we don't need to acquire ->elevator_lock in this path.
> Please see below.

blk_mq_update_nr_hw_queues() can come anytime, and there is still
race between blk_mq_update_nr_hw_queues and add/del disk.

> 
> >>
> >> blk_mq_update_nr_hw_queues
> >>   -> tag_list_lock
> >>     -> elevator_lock
> >>      -> freeze_lock 
> > 
> > Actually freeze lock is already held for nvme before calling
> > blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
> > frozen for updating nr_hw_queues, so the above order may not match
> > with the existed code.
> > 
> > Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
> > 
> I think we should consider (may be in different patch) updating
> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
> its dependency on ->tag_list_lock.

If we need to take nvme into account, the above lock order doesn't work,
because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
and elevator lock still depends on freeze lock.

If it needn't to be considered, per-set lock becomes necessary.

> 
> >>
> >> elv_iosched_store
> >>   -> elevator_lock
> >>     -> freeze_lock
> > 
> > I understand that the per-set elevator_lock is just for avoiding the
> > nested elvevator lock class acquire? If we needn't to consider nvme
> > or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed.
> > 
> > It is actually easy to sync elevator store vs. update nr_hw_queues.
> > 
> >>
> >> So now ->freeze_lock should not depend on ->elevator_lock and that shall
> >> help avoid few of the recent lockdep splats reported with fs_reclaim.
> >> What do you think?
> > 
> > Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
> > related splat.
> > 
> > However, in del_gendisk(), freeze_lock is still held before calling
> > elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.
> 
> Yes agreed, however elevator_exit() called from del_gendisk() or 
> elv_unregister_queue() called from blk_unregister_queue() are called 
> after we unregister the queue. And if queue has been already unregistered
> while invoking elevator_exit or del_gensidk then ideally we don't need to
> acquire ->elevator_lock. The same is true for elevator_exit() called 
> from add_disk_fwnode(). So IMO, we should update these paths to avoid 
> acquiring ->elevator_lock.

As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
host wide event, so either lock or srcu sync is needed.

Thanks,
Ming