Re: [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Mon, 23 Jun 2025 15:03:27 +0530



On 6/23/25 11:40 AM, Christoph Hellwig wrote:
> On Mon, Jun 16, 2025 at 11:02:26PM +0530, Nilay Shroff wrote:
>> +static void blk_mq_init_sched_tags(struct request_queue *q,
>> +				   struct blk_mq_hw_ctx *hctx,
>> +				   unsigned int hctx_idx,
>> +				   struct elevator_queue *eq)
>>  {
>>  	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
>>  		hctx->sched_tags = q->sched_shared_tags;
>> +		return;
>>  	}
>>  
>> +	hctx->sched_tags = eq->tags->u.tags[hctx_idx];
>>  }
> 
> Given how trivial this function now is, please inline it in the only
> caller.  That should also allow moving the blk_mq_is_shared_tags
> shared out of the loop over all hw contexts, and merge it with the
> check right next to it.
> 
okay, will do it the next patchset.

>> +static void blk_mq_init_sched_shared_tags(struct request_queue *queue,
>> +		struct elevator_queue *eq)
>>  {
>> +	queue->sched_shared_tags = eq->tags->u.shared_tags;
>>  	blk_mq_tag_update_sched_shared_tags(queue);
>>  }
> 
> This helper can also just be open coded in the caller now.
Yes agreed...

> 
>> +	if (blk_mq_is_shared_tags(set->flags)) {
>> +		if (tags->u.shared_tags) {
>> +			blk_mq_free_rqs(set, tags->u.shared_tags,
>> +					BLK_MQ_NO_HCTX_IDX);
>> +			blk_mq_free_rq_map(tags->u.shared_tags);
>> +		}
>> +		goto out;
>> +	}
>> +
>> +	if (!tags->u.tags)
>> +		goto out;
>> +
>> +	for (i = 0; i < tags->nr_hw_queues; i++) {
>> +		if (tags->u.tags[i]) {
>> +			blk_mq_free_rqs(set, tags->u.tags[i], i);
>> +			blk_mq_free_rq_map(tags->u.tags[i]);
>> +		}
>> +	}
> 
> Maybe restructucture this a bit:
> 
> 	if (blk_mq_is_shared_tags(set->flags)) {
> 		..
> 	} else if (tags->u.tags) {
> 	}
> 
> 	kfree(tags);
> 
> to have a simpler flow and remove the need for the "goto out"?
> 
Okay but as you know I am rewriting this logic using Xarray as
Ming pointed in last email. So this code will be restructured and
I will ensure that your above comment will be addressed in new 
logic.

>> +	tags = kcalloc(1, sizeof(struct elevator_tags), gfp);
> 
> This can use plain kzalloc.
> 
>> +	if (blk_mq_is_shared_tags(set->flags)) {
>> +
>> +		tags->u.shared_tags = blk_mq_alloc_map_and_rqs(set,
> 
> The empty line above is a bit odd.
Yes I would address this in next patchset.

> 
>> +					BLK_MQ_NO_HCTX_IDX,
>> +					MAX_SCHED_RQ);
>> +		if (!tags->u.shared_tags)
>> +			goto out;
>> +
>> +		return tags;
>> +	}
>> +
>> +	tags->u.tags = kcalloc(nr_hw_queues, sizeof(struct blk_mq_tags *), gfp);
>> +	if (!tags->u.tags)
>> +		goto out;
>> +
>> +	tags->nr_hw_queues = nr_hw_queues;
>> +	for (i = 0; i < nr_hw_queues; i++) {
>> +		tags->u.tags[i] = blk_mq_alloc_map_and_rqs(set, i,
>> +				tags->nr_requests);
>> +		if (!tags->u.tags[i])
>> +			goto out;
>> +	}
>> +
>> +	return tags;
>> +
>> +out:
>> +	__blk_mq_free_sched_tags(set, tags);
> 
> Is __blk_mq_free_sched_tags really the right thing here vs just unwinding
> what this function did?
> 
Hmm, __blk_mq_free_sched_tags actually releases everything which is allocated
in this function (so there won't be any leak), however, it really _dose_not_ 
free resources in the reverse order. It just loops over starting from the first 
nr_hw_queue index and releases whatsoever allocated. But I think you're right, 
we could probably have more structured way of unwinding stuff here instead 
of calling __blk_mq_free_sched_tags. I will handle this in the next patch.

>> +	/*
>> +	 * Accessing q->elevator without holding q->elevator_lock is safe
>> +	 * because we're holding here set->update_nr_hwq_lock in the writer
>> +	 * context. So, scheduler update/switch code (which acquires the same
>> +	 * lock but in the reader context) can't run concurrently.
>> +	 */
>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> +		if (q->elevator)
>> +			count++;
>> +	}
> 
> Maybe add a helper for this code and the comment?
Yeah but as I mentioned above with Xarray code being added, we may 
not need the above loop. So lets see, if needed I will add a macro
and add proper comment in next patch.

> 
>> -	lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
>> +	lockdep_assert_held(&set->update_nr_hwq_lock);
>> +
>> +	if (strncmp(ctx->name, "none", 4)) {
> 
> This is a check for not having an elevator so far, right?  Wouldn't
> !q->elevator be the more obvious check for that?  Or am I missing
> something why that's not safe here?
> 
This code runs in the context of an elevator switch, not as part of an
nr_hw_queues update. Hence, at this point, q->elevator has not yet been
updated to the new elevator we’re switching to, so accessing q->elevator
here would be incorrect. Since we've already stored the name of the target
elevator in ctx->name, we use that instead of referencing q->elevator here.

>> diff --git a/block/elevator.h b/block/elevator.h
>> index a4de5f9ad790..0b92121005cf 100644
>> --- a/block/elevator.h
>> +++ b/block/elevator.h
>> @@ -23,6 +23,17 @@ enum elv_merge {
>>  struct blk_mq_alloc_data;
>>  struct blk_mq_hw_ctx;
>>  
>> +/* Holding context data for changing elevator */
>> +struct elv_change_ctx {
>> +	const char *name;
>> +	bool no_uevent;
>> +
>> +	/* for unregistering old elevator */
>> +	struct elevator_queue *old;
>> +	/* for registering new elevator */
>> +	struct elevator_queue *new;
>> +};
> 
> No need to move this, it is still only used in elevator.c.
> 
Yes agreed. I'm moved it back to its original place in 
elevator.c

>>  extern struct elevator_queue *elevator_alloc(struct request_queue *,
>> -					struct elevator_type *);
>> +		struct elevator_type *, struct elevator_tags *);
> 
> Drop the extern while you're at it.
> 
Yeah sure.

Thanks,
--Nilay