Re: [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Wed, 18 Jun 2025 12:22:50 +0530

On 6/18/25 8:36 AM, Ming Lei wrote:

>> +int blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>> +			    unsigned int nr_hw_queues,
>> +			    struct elevator_tags ***elv_tags)
> 
> We seldom see triple indirect pointers in linux kernel, :-) 

Yeah lets see if we could avoid it by using xarray as you suggested.
> 
>> +{
>> +	unsigned long idx = 0, count = 0;
>> +	struct request_queue *q;
>> +	struct elevator_tags **tags;
>> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
>> +
>> +	lockdep_assert_held_write(&set->update_nr_hwq_lock);
>> +	/*
>> +	 * Accessing q->elevator without holding q->elevator_lock is safe
>> +	 * because we're holding here set->update_nr_hwq_lock in the writer
>> +	 * context. So, scheduler update/switch code (which acquires the same
>> +	 * lock but in the reader context) can't run concurrently.
>> +	 */
>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> +		if (q->elevator)
>> +			count++;
>> +	}
>> +
>> +	if (!count)
>> +		return 0;
>> +
>> +	tags = kcalloc(count, sizeof(struct elevator_tags *), gfp);
>> +	if (!tags)
>> +		return -ENOMEM;
>> +
>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> +		if (q->elevator) {
>> +			tags[idx] = __blk_mq_alloc_sched_tags(set, nr_hw_queues,
>> +					q->id);
>> +			if (!tags[idx])
>> +				goto out;
>> +
>> +			idx++;
>> +		}
>> +	}
>> +
>> +	*elv_tags = tags;
>> +	return count;
>> +out:
>> +	blk_mq_free_sched_tags(set, tags);
>> +	return -ENOMEM;
>> +}
> 
> I believe lots of code can be saved if you take xarray to store the
> allocated 'struct elevator_tags', and use q->id as the key.
> 
I think using xarray is a nice idea! I will rewrite this code 
using xarray in the next revision.

> Also the handling for updating_nr_hw_queues has new batch alloc & free logic
> and should be done in standalone patch. Per my experience, bug is easier to
> hide in the big patch, which is more hard to review.
> 
Sure, I'd further split the patch in the next revision.

>> @@ -676,11 +690,16 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>>  	blk_mq_cancel_work_sync(q);
>>  	mutex_lock(&q->elevator_lock);
>>  	if (!(q->elevator && elevator_match(q->elevator->type, ctx->name)))
>> -		ret = elevator_switch(q, ctx);
>> +		ret = elevator_switch(q, ctx, tags);
> 
> tags may be passed via 'ctx'.
Hmm ok we can do that. I will add @tags in @ctx.

> 
>>  	mutex_unlock(&q->elevator_lock);
>>  	blk_mq_unfreeze_queue(q, memflags);
>>  	if (!ret)
>>  		ret = elevator_change_done(q, ctx);
>> +	/*
>> +	 * Free sched tags if it's allocated but we couldn't switch elevator.
>> +	 */
>> +	if (tags && !ctx->new)
>> +		__blk_mq_free_sched_tags(set, tags);
> 
> Maybe you can let elevator_change_done() cover failure handling, and
> move the above two line into elevator_change_done().
Yes precisely I thought about it, but then later realized that 
elevator_change_done() may not be called always. For instance, in 
case elevator_switch() fails and hence returns non-zero value 
then in that case elevator_change_done() would be skipped. But 
still in the elvator_switch() failure case, we'd have allocated 
sched_tags and so we have to free it.

> 
>> diff --git a/block/elevator.h b/block/elevator.h
>> index a4de5f9ad790..0b92121005cf 100644
>> --- a/block/elevator.h
>> +++ b/block/elevator.h
>> @@ -23,6 +23,17 @@ enum elv_merge {
>>  struct blk_mq_alloc_data;
>>  struct blk_mq_hw_ctx;
>>  
>> +/* Holding context data for changing elevator */
>> +struct elv_change_ctx {
>> +	const char *name;
>> +	bool no_uevent;
>> +
>> +	/* for unregistering old elevator */
>> +	struct elevator_queue *old;
>> +	/* for registering new elevator */
>> +	struct elevator_queue *new;
>> +};
>> +
> 
> 'elv_change_ctx' is only used in block/elevator.c, not sure why you
> move it to the header.
Oh yes, in the previous version of this patch it was used outside of
the elevator.c and so I moved it into  a common header but later in
this patch I missed to move it back into the elevator.h file as now
the only user of 'elv_change_ctx' exists in elevator.c file. I will
fix this in the next revision.

>>  struct elevator_mq_ops {
>>  	int (*init_sched)(struct request_queue *, struct elevator_queue *);
>>  	void (*exit_sched)(struct elevator_queue *);
>> @@ -107,12 +118,27 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq);
>>  void elv_rqhash_reposition(struct request_queue *q, struct request *rq);
>>  struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
>>  
>> +struct elevator_tags {
>> +	/* num. of hardware queues for which tags are allocated */
>> +	unsigned int nr_hw_queues;
>> +	/* depth used while allocating tags */
>> +	unsigned int nr_requests;
>> +	/* request queue id (q->id) used during elevator_tags_lookup() */
>> +	int id;
>> +	union {
>> +		/* tags shared across all queues */
>> +		struct blk_mq_tags *shared_tags;
>> +		/* array of blk_mq_tags pointers, one per hardware queue. */
>> +		struct blk_mq_tags **tags;
>> +	} u;
>> +};
> 
> I feel it is simpler to just save shared tags in the 1st item
> of the array, then you can store 'struct blk_mq_tags' in flex array of
> 'struct elevator_tags', save one extra allocation & many failure handling
> code.
> 
Yes sounds good! I will send out another patchset with above 
suggested changes.

Thanks,
--Nilay