On Wed, Apr 09, 2025 at 07:16:03PM +0530, Nilay Shroff wrote: > > > On 4/9/25 5:16 PM, Ming Lei wrote: > >>>>> Not sure it is easily, ->tag_list_lock is exactly for protecting the list of "set->tag_list". > >>>>> > >>>> Please see this, here nvme_quiesce_io_queues doen't require ->tag_list_lock: > >>>> > >>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > >>>> index 777db89fdaa7..002d2fd20e0c 100644 > >>>> --- a/drivers/nvme/host/core.c > >>>> +++ b/drivers/nvme/host/core.c > >>>> @@ -5010,10 +5010,19 @@ void nvme_quiesce_io_queues(struct nvme_ctrl *ctrl) > >>>> { > >>>> if (!ctrl->tagset) > >>>> return; > >>>> - if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) > >>>> - blk_mq_quiesce_tagset(ctrl->tagset); > >>>> - else > >>>> - blk_mq_wait_quiesce_done(ctrl->tagset); > >>>> + if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags)) { > >>>> + struct nvme_ns *ns; > >>>> + int srcu_idx; > >>>> + > >>>> + srcu_idx = srcu_read_lock(&ctrl->srcu); > >>>> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list, > >>>> + srcu_read_lock_held(&ctrl->srcu)) { > >>>> + if (!blk_queue_skip_tagset_quiesce(ns->queue)) > >>>> + blk_mq_quiesce_queue_nowait(ns->queue); > >>>> + } > >>>> + srcu_read_unlock(&ctrl->srcu, srcu_idx); > >>>> + } > >>>> + blk_mq_wait_quiesce_done(ctrl->tagset); > >>>> } > >>>> EXPORT_SYMBOL_GPL(nvme_quiesce_io_queues); > >>>> > >>>> Here we iterate through ctrl->namespaces instead of relying on tag_list > >>>> and so we don't need to acquire ->tag_list_lock. > >>> > >>> How can you make sure all NSs are covered in this way? RCU/SRCU can't > >>> provide such kind of guarantee. > >>> > >> Why is that so? In fact, nvme_wait_freeze also iterates through > >> the same ctrl->namespaces to freeze the queue. > > > > It depends if nvme error handling needs to cover new coming NS, > > suppose it doesn't care, and you can change to srcu and bypass > > ->tag_list_lock. > > > Yes new incoming NS may not be live yet when we iterate through > ctrl->namespaces. So we don't need bother about it yet. > >> > >>>> > >>>>> And the same list is iterated in blk_mq_update_nr_hw_queues() too. > >>>>> > >>>>>> > >>>>>>> > >>>>>>> So all queues should be frozen first before calling blk_mq_update_nr_hw_queues, > >>>>>>> fortunately that is what nvme is doing. > >>>>>>> > >>>>>>> > >>>>>>>> If yes then it means that we should be able to grab ->elevator_lock > >>>>>>>> before freezing the queue in __blk_mq_update_nr_hw_queues and so locking > >>>>>>>> order should be in each code path, > >>>>>>>> > >>>>>>>> __blk_mq_update_nr_hw_queues > >>>>>>>> ->elevator_lock > >>>>>>>> ->freeze_lock > >>>>>>> > >>>>>>> Now tagset->elevator_lock depends on set->tag_list_lock, and this way > >>>>>>> just make things worse. Why can't we disable elevator switch during > >>>>>>> updating nr_hw_queues? > >>>>>>> > >>>>>> I couldn't quite understand this. As we already first disable the elevator > >>>>>> before updating sw to hw queue mapping in __blk_mq_update_nr_hw_queues(). > >>>>>> Once mapping is successful we switch back the elevator. > >>>>> > >>>>> Yes, but user still may switch elevator from none to others during the > >>>>> period, right? > >>>>> > >>>> Yes correct, that's possible. So your suggestion was to disable elevator > >>>> update while we're running __blk_mq_update_nr_hw_queues? And that way user > >>>> couldn't update elevator through sysfs (elv_iosched_store) while we update > >>>> nr_hw_queues? If this is true then still how could it help solve lockdep > >>>> splat? > >>> > >>> Then why do you think per-set lock can solve the lockdep splat? > >>> > >>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues > >>> involved wrt. switching elevator. If elevator switching is not allowed > >>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set > >>> lock? > >>> > >> Yes if elevator switch is not allowed then we probably don't need per-set lock. > >> However my question was if we were to not allow elevator switch while > >> __blk_mq_update_nr_hw_queues is running then how would we implement it? > > > > It can be done easily by tag_set->srcu. > Ok great if that's possible! But I'm not sure how it could be done in this > case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store > run in the writer/updater context. So you may still need lock? Can you > please send across a (informal) patch with your idea ? Please see the attached patch which treats elv_iosched_store() as reader. > > > > >> Do we need to synchronize with ->tag_list_lock? Or in another words, > >> elv_iosched_store would now depends on ->tag_list_lock ? > > > > ->tag_list_lock isn't involved. > > > >> > >> On another note, if we choose to make ->elevator_lock per-set then > >> our locking sequence in blk_mq_update_nr_hw_queues() would be, > > > > There is also add/del disk vs. updating nr_hw_queues, do you want to > > add the per-set lock in add/del disk path too? > > Ideally no we don't need to acquire ->elevator_lock in this path. > Please see below. blk_mq_update_nr_hw_queues() can come anytime, and there is still race between blk_mq_update_nr_hw_queues and add/del disk. > > >> > >> blk_mq_update_nr_hw_queues > >> -> tag_list_lock > >> -> elevator_lock > >> -> freeze_lock > > > > Actually freeze lock is already held for nvme before calling > > blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue > > frozen for updating nr_hw_queues, so the above order may not match > > with the existed code. > > > > Do we need to consider nvme or blk_mq_update_nr_hw_queues now? > > > I think we should consider (may be in different patch) updating > nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove > its dependency on ->tag_list_lock. If we need to take nvme into account, the above lock order doesn't work, because nvme freezes queue before calling blk_mq_update_nr_hw_queues(), and elevator lock still depends on freeze lock. If it needn't to be considered, per-set lock becomes necessary. > > >> > >> elv_iosched_store > >> -> elevator_lock > >> -> freeze_lock > > > > I understand that the per-set elevator_lock is just for avoiding the > > nested elvevator lock class acquire? If we needn't to consider nvme > > or blk_mq_update_nr_hw_queues(), this per-set lock may not be needed. > > > > It is actually easy to sync elevator store vs. update nr_hw_queues. > > > >> > >> So now ->freeze_lock should not depend on ->elevator_lock and that shall > >> help avoid few of the recent lockdep splats reported with fs_reclaim. > >> What do you think? > > > > Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim > > related splat. > > > > However, in del_gendisk(), freeze_lock is still held before calling > > elevator_exit() and blk_unregister_queue(), and looks not easy to reorder. > > Yes agreed, however elevator_exit() called from del_gendisk() or > elv_unregister_queue() called from blk_unregister_queue() are called > after we unregister the queue. And if queue has been already unregistered > while invoking elevator_exit or del_gensidk then ideally we don't need to > acquire ->elevator_lock. The same is true for elevator_exit() called > from add_disk_fwnode(). So IMO, we should update these paths to avoid > acquiring ->elevator_lock. As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one host wide event, so either lock or srcu sync is needed. Thanks, Ming