On 4/9/25 5:16 PM, Ming Lei wrote: >>>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". >>>>> >>>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock: >>>> >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>>> index 777db89fdaa7..002d2fd20e0c 100644 >>>> --- a/drivers/nvme/host/core.c >>>> +++ b/drivers/nvme/host/core.c >>>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) >>>> { >>>> if (!ctrl->tagset) >>>> return; >>>> - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) >>>> - blk_mq_quiesce_tagset(ctrl->tagset); >>>> - else >>>> - blk_mq_wait_quiesce_done(ctrl->tagset); >>>> + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) { >>>> + struct nvme_ns *ns; >>>> + int srcu_idx; >>>> + >>>> + srcu_idx = srcu_read_lock(&ctrl->srcu); >>>> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, >>>> + srcu_read_lock_held(&ctrl->srcu)) { >>>> + if (!blk_queue_skip_tagset_quiesce(ns->queue)) >>>> + blk_mq_quiesce_queue_nowait(ns->queue); >>>> + } >>>> + srcu_read_unlock(&ctrl->srcu, srcu_idx); >>>> + } >>>> + blk_mq_wait_quiesce_done(ctrl->tagset); >>>> } >>>> EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); >>>> >>>> Here we iterate through ctrl->namespaces instead of relying on tag_list >>>> and so we don't need to acquire ->tag_list_lock. >>> >>> How can you make sure all NSs are covered in this way? RCU/SRCU can't >>> provide such kind of guarantee. >>> >> Why is that so? In fact, nvme_wait_freeze also iterates through >> the same ctrl->namespaces to freeze the queue. > > It depends if nvme error handling needs to cover new coming NS, > suppose it doesn't care, and you can change to srcu and bypass > ->tag_list_lock. > Yes new incoming NS may not be live yet when we iterate through ctrl->namespaces. So we don't need bother about it yet. >> >>>> >>>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too. >>>>> >>>>>> >>>>>>> >>>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, >>>>>>> fortunately that is what nvme is doing. >>>>>>> >>>>>>> >>>>>>>> If yes then it means that we should be able to grab ->elevator_lock >>>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking >>>>>>>> order should be in each code path, >>>>>>>> >>>>>>>> __blk_mq_update_nr_hw_queues >>>>>>>> ->elevator_lock >>>>>>>> ->freeze_lock >>>>>>> >>>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way >>>>>>> just make things worse. Why can't we disable elevator switch during >>>>>>> updating nr_hw_queues? >>>>>>> >>>>>> I couldn't quite understand this. As we already first disable the elevator >>>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). >>>>>> Once mapping is successful we switch back the elevator. >>>>> >>>>> Yes, but user still may switch elevator from none to others during the >>>>> period, right? >>>>> >>>> Yes correct, that's possible. So your suggestion was to disable elevator >>>> update while we're running __blk_mq_update_nr_hw_queues? And that way user >>>> couldn't update elevator through sysfs (elv_iosched_store) while we update >>>> nr_hw_queues? If this is true then still how could it help solve lockdep >>>> splat? >>> >>> Then why do you think per-set lock can solve the lockdep splat? >>> >>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues >>> involved wrt. switching elevator. If elevator switching is not allowed >>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set >>> lock? >>> >> Yes if elevator switch is not allowed then we probably don't need per-set lock. >> However my question was if we were to not allow elevator switch while >> __blk_mq_update_nr_hw_queues is running then how would we implement it? > > It can be done easily by tag_set->srcu. Ok great if that's possible! But I'm not sure how it could be done in this case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store run in the writer/updater context. So you may still need lock? Can you please send across a (informal) patch with your idea ? > >> Do we need to synchronize with ->tag_list_lock? Or in another words, >> elv_iosched_store would now depends on ->tag_list_lock ? > > ->tag_list_lock isn't involved. > >> >> On another note, if we choose to make ->elevator_lock per-set then >> our locking sequence in blk_mq_update_nr_hw_queues() would be, > > There is also add/del disk vs. updating nr_hw_queues, do you want to > add the per-set lock in add/del disk path too? Ideally no we don't need to acquire ->elevator_lock in this path. Please see below. >> >> blk_mq_update_nr_hw_queues >> -> tag_list_lock >> -> elevator_lock >> -> freeze_lock > > Actually freeze lock is already held for nvme before calling > blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue > frozen for updating nr_hw_queues, so the above order may not match > with the existed code. > > Do we need to consider nvme or blk_mq_update_nr_hw_queues now? > I think we should consider (may be in different patch) updating nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove its dependency on ->tag_list_lock. >> >> elv_iosched_store >> -> elevator_lock >> -> freeze_lock > > I understand that the per-set elevator_lock is just for avoiding the > nested elvevator lock class acquire? If we needn't to consider nvme > or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed. > > It is actually easy to sync elevator store vs. update nr_hw_queues. > >> >> So now ->freeze_lock should not depend on ->elevator_lock and that shall >> help avoid few of the recent lockdep splats reported with fs_reclaim. >> What do you think? > > Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim > related splat. > > However, in del_gendisk(), freeze_lock is still held before calling > elevator_exit() and blk_unregister_queue(), and looks not easy to reorder. Yes agreed, however elevator_exit() called from del_gendisk() or elv_unregister_queue() called from blk_unregister_queue() are called after we unregister the queue. And if queue has been already unregistered while invoking elevator_exit or del_gensidk then ideally we don't need to acquire ->elevator_lock. The same is true for elevator_exit() called from add_disk_fwnode(). So IMO, we should update these paths to avoid acquiring ->elevator_lock. Thanks, --Nilay