On Tue, Apr 15, 2025 at 06:00:47PM +0530, Nilay Shroff wrote: > > > On 4/14/25 6:52 AM, Ming Lei wrote: > > On Fri, Apr 11, 2025 at 12:07:34AM +0530, Nilay Shroff wrote: > >> > >> > >> On 4/10/25 7:00 PM, Ming Lei wrote: > >>> /* > >>> * Use the default elevator settings. If the chosen elevator initialization > >>> * fails, fall back to the "none" elevator (no elevator). > >>> */ > >>> -void elevator_init_mq(struct request_queue *q) > >>> +void elevator_set_default(struct request_queue *q) > >>> { > >>> - struct elevator_type *e; > >>> - unsigned int memflags; > >>> + struct elev_change_ctx ctx = { }; > >>> int err; > >>> > >>> - WARN_ON_ONCE(blk_queue_registered(q)); > >>> - > >>> - if (unlikely(q->elevator)) > >>> + if (!queue_is_mq(q)) > >>> return; > >>> > >>> - e = elevator_get_default(q); > >>> - if (!e) > >>> + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none"; > >>> + if (!q->elevator && !strcmp(ctx.name, "none")) > >>> return; > >>> + err = elevator_change(q, &ctx); > >>> + if (err < 0) > >>> + pr_warn("\"%s\" set elevator failed %d, " > >>> + "falling back to \"none\"\n", ctx.name, err); > >>> +} > >>> > >> If we fail to set the evator to default (mq-deadline) while registering queue, > >> because nr_hw_queue update is simultaneously running then we may end up setting > >> the queue elevator to none and that's not correct. Isn't it? > > > > It still works with none. > > > > I think it isn't one big deal. And if it is really one issue in future, we can > > set one flag in elevator_set_default(), and let blk_mq_update_nr_hw_queues set > > default sched for us. > > > >> > >>> +void elevator_set_none(struct request_queue *q) > >>> +{ > >>> + struct elev_change_ctx ctx = { > >>> + .name = "none", > >>> + .uevent = 1, > >>> + }; > >>> + int err; > >>> > >>> - blk_mq_unfreeze_queue(q, memflags); > >>> + if (!queue_is_mq(q)) > >>> + return; > >>> > >>> - if (err) { > >>> - pr_warn("\"%s\" elevator initialization failed, " > >>> - "falling back to \"none\"\n", e->elevator_name); > >>> - } > >>> + if (!q->elevator) > >>> + return; > >>> > >>> - elevator_put(e); > >>> + err = elevator_change(q, &ctx); > >>> + if (err < 0) > >>> + pr_warn("%s: set none elevator failed %d\n", __func__, err); > >>> } > >>> > >> Here as well if we fail to disable/exit elevator while deleting disk > >> because nr_hw_queue update is simultaneously running then we may > >> leak elevator resource? > > > > When blk_mq_update_nr_hw_queues() observes that queue is dying, it > > forces to change elevator to none, so there isn't elevator leak issue. > > > Yes if we get into blk_mq_update_nr_hw_queues after dying flag is set. > But what if blk_mq_update_nr_hw_queues doesn't see dying flag and starts > running __elevator_change. However later we set dying flag from del_gendisk > and starts running elevator_set_none simultaneously on another cpu? > In this case elevator_set_none would fail to set the elevator to "none" as > blk_mq_update_nr_hw_queues is running on another cpu. Isn't it? > > >> > >>> @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, > >>> if (disk->major == BLOCK_EXT_MAJOR) > >>> blk_free_ext_minor(disk->first_minor); > >>> out_exit_elevator: > >>> - if (disk->queue->elevator) { > >>> - mutex_lock(&disk->queue->elevator_lock); > >>> - elevator_exit(disk->queue); > >>> - mutex_unlock(&disk->queue->elevator_lock); > >>> - } > >>> + elevator_set_none(disk->queue); > >> Same comment as above here as well but this is in add_disk code path. > > > > We can avoid it by forcing to change to none in blk_mq_update_nr_hw_queues() for > > !blk_queue_registered() > > > Here as well there's a thin race window possible assuming add_disk fails > after we registered queue. Assuming nr_hw_queue update starts running > and it sees queue is registered however on another cpu add_disk fails > just after registering queue. So in this case still it might be possible > that elevator_set_none might fail to set elevator to "none" just because > nr_hw_queue update is running on another cpu. What do you think? Yeah. It isn't hard to solve, but just don't want to make the whole implementation too complicated. Another way is to prevent add_disk & del_disk from happening during updating nr_hw_queues, and this way is reasonable too because both blk_mq debugfs & sysfs registering depends on nr_hw_queues. Meantime we can retry add_disk/del_disk until updating_nr_hw_queues are finished, and one waitqueue can be added, so the wait can be: add_disk(): while (true) { srcu_read_lock() if (set->is_updating_nr_hw_queus) { srcu_read_unlock(); goto wait; } __add_disk(); srcu_read_unlock() break; wait: wait_event(set->wq, !set->is_updating_nr_hw_queus); } Thanks, Ming