On 6/18/25 8:36 AM, Ming Lei wrote: >> +int blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, >> + unsigned int nr_hw_queues, >> + struct elevator_tags ***elv_tags) > > We seldom see triple indirect pointers in linux kernel, :-) Yeah lets see if we could avoid it by using xarray as you suggested. > >> +{ >> + unsigned long idx = 0, count = 0; >> + struct request_queue *q; >> + struct elevator_tags **tags; >> + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY; >> + >> + lockdep_assert_held_write(&set->update_nr_hwq_lock); >> + /* >> + * Accessing q->elevator without holding q->elevator_lock is safe >> + * because we're holding here set->update_nr_hwq_lock in the writer >> + * context. So, scheduler update/switch code (which acquires the same >> + * lock but in the reader context) can't run concurrently. >> + */ >> + list_for_each_entry(q, &set->tag_list, tag_set_list) { >> + if (q->elevator) >> + count++; >> + } >> + >> + if (!count) >> + return 0; >> + >> + tags = kcalloc(count, sizeof(struct elevator_tags *), gfp); >> + if (!tags) >> + return -ENOMEM; >> + >> + list_for_each_entry(q, &set->tag_list, tag_set_list) { >> + if (q->elevator) { >> + tags[idx] = __blk_mq_alloc_sched_tags(set, nr_hw_queues, >> + q->id); >> + if (!tags[idx]) >> + goto out; >> + >> + idx++; >> + } >> + } >> + >> + *elv_tags = tags; >> + return count; >> +out: >> + blk_mq_free_sched_tags(set, tags); >> + return -ENOMEM; >> +} > > I believe lots of code can be saved if you take xarray to store the > allocated 'struct elevator_tags', and use q->id as the key. > I think using xarray is a nice idea! I will rewrite this code using xarray in the next revision. > Also the handling for updating_nr_hw_queues has new batch alloc & free logic > and should be done in standalone patch. Per my experience, bug is easier to > hide in the big patch, which is more hard to review. > Sure, I'd further split the patch in the next revision. >> @@ -676,11 +690,16 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx) >> blk_mq_cancel_work_sync(q); >> mutex_lock(&q->elevator_lock); >> if (!(q->elevator && elevator_match(q->elevator->type, ctx->name))) >> - ret = elevator_switch(q, ctx); >> + ret = elevator_switch(q, ctx, tags); > > tags may be passed via 'ctx'. Hmm ok we can do that. I will add @tags in @ctx. > >> mutex_unlock(&q->elevator_lock); >> blk_mq_unfreeze_queue(q, memflags); >> if (!ret) >> ret = elevator_change_done(q, ctx); >> + /* >> + * Free sched tags if it's allocated but we couldn't switch elevator. >> + */ >> + if (tags && !ctx->new) >> + __blk_mq_free_sched_tags(set, tags); > > Maybe you can let elevator_change_done() cover failure handling, and > move the above two line into elevator_change_done(). Yes precisely I thought about it, but then later realized that elevator_change_done() may not be called always. For instance, in case elevator_switch() fails and hence returns non-zero value then in that case elevator_change_done() would be skipped. But still in the elvator_switch() failure case, we'd have allocated sched_tags and so we have to free it. > >> diff --git a/block/elevator.h b/block/elevator.h >> index a4de5f9ad790..0b92121005cf 100644 >> --- a/block/elevator.h >> +++ b/block/elevator.h >> @@ -23,6 +23,17 @@ enum elv_merge { >> struct blk_mq_alloc_data; >> struct blk_mq_hw_ctx; >> >> +/* Holding context data for changing elevator */ >> +struct elv_change_ctx { >> + const char *name; >> + bool no_uevent; >> + >> + /* for unregistering old elevator */ >> + struct elevator_queue *old; >> + /* for registering new elevator */ >> + struct elevator_queue *new; >> +}; >> + > > 'elv_change_ctx' is only used in block/elevator.c, not sure why you > move it to the header. Oh yes, in the previous version of this patch it was used outside of the elevator.c and so I moved it into a common header but later in this patch I missed to move it back into the elevator.h file as now the only user of 'elv_change_ctx' exists in elevator.c file. I will fix this in the next revision. >> struct elevator_mq_ops { >> int (*init_sched)(struct request_queue *, struct elevator_queue *); >> void (*exit_sched)(struct elevator_queue *); >> @@ -107,12 +118,27 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq); >> void elv_rqhash_reposition(struct request_queue *q, struct request *rq); >> struct request *elv_rqhash_find(struct request_queue *q, sector_t offset); >> >> +struct elevator_tags { >> + /* num. of hardware queues for which tags are allocated */ >> + unsigned int nr_hw_queues; >> + /* depth used while allocating tags */ >> + unsigned int nr_requests; >> + /* request queue id (q->id) used during elevator_tags_lookup() */ >> + int id; >> + union { >> + /* tags shared across all queues */ >> + struct blk_mq_tags *shared_tags; >> + /* array of blk_mq_tags pointers, one per hardware queue. */ >> + struct blk_mq_tags **tags; >> + } u; >> +}; > > I feel it is simpler to just save shared tags in the 1st item > of the array, then you can store 'struct blk_mq_tags' in flex array of > 'struct elevator_tags', save one extra allocation & many failure handling > code. > Yes sounds good! I will send out another patchset with above suggested changes. Thanks, --Nilay