Re: [PATCHv2] block: fix lock dependency between percpu alloc lock and elevator lock

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Wed, 11 Jun 2025 11:11:20 +0530

On 6/9/25 7:43 AM, Ming Lei wrote:
> On Thu, Jun 05, 2025 at 11:22:17AM +0530, Nilay Shroff wrote:
>>
>>
>> On 6/3/25 2:07 PM, Ming Lei wrote:
>>> On Wed, May 28, 2025 at 06:03:58PM +0530, Nilay Shroff wrote:
>>
>>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>>> index 0cb1e9873aab..51d406da4abf 100644
>>>> --- a/block/bfq-iosched.c
>>>> +++ b/block/bfq-iosched.c
>>>> @@ -7232,17 +7232,12 @@ static void bfq_init_root_group(struct bfq_group *root_group,
>>>>  	root_group->sched_data.bfq_class_idle_last_service = jiffies;
>>>>  }
>>>>  
>>>> -static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
>>>> +static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
>>>>  {
>>>>  	struct bfq_data *bfqd;
>>>> -	struct elevator_queue *eq;
>>>>  	unsigned int i;
>>>>  	struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
>>>>  
>>>> -	eq = elevator_alloc(q, e);
>>>> -	if (!eq)
>>>> -		return -ENOMEM;
>>>> -
>>>
>>> Please make the above elevator interface change be one standalone patch if possible,
>>> which may help review much.
>>
>> Yeah you're right, in fact I considered to make elevator interface changes into 
>> a separate patch however the changes are so much tightly coupled with block layer
>> change that I couldn't make it independent. As you may see here I replaced second
>> argument of ->init_sched from "struct elevator_type *" to "struct elevator_queue *".
>> And as you might have noticed we allocate the "struct elevator_queue *" very early
>> while switching the elevator and that's passed down to this ->init_sched function
>> through elv_change_ctx. So making elevator interface change a standalone patch 
>> is not possible.
> 
> You can make a patch to move elevator_queue allocation into
> blk_mq_init_sched() first, then pass it to all ->init_sched()
> implementation first. And you can avoid to pass 'elevator_queue *'
> from 'struct elv_change_ctx'.
> 
Hmm okay, so we'd allocate elevator_queue, while we're holding ->elevator_lock 
and ->freeze_lock, from within blk_mq_init_sched(). As you mentioned it's safe
because elevator_queue dynamic allocation uses a plain kmalloc, I'd change this
in the next patch.

>>
>>>>  
>>>> +int blk_mq_alloc_elevator_and_sched_tags(struct request_queue *q,
>>>> +		struct elv_change_ctx *ctx)
>>>> +{
>>>> +	unsigned long i;
>>>> +	struct elevator_queue *elevator;
>>>> +	struct elevator_type *e;
>>>> +	struct blk_mq_tag_set *set = q->tag_set;
>>>> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
>>>> +
>>>> +	if (strncmp(ctx->new.name, "none", 4)) {
>>>
>>> You can return early if new name is "none", then all indent in the code
>>> block can be avoided.
>> Yeah agreed, will do it in the next patch.
>>
>>>
>>>> +
>>>> +		e = elevator_find_get(ctx->new.name);
>>>> +		if (!e)
>>>> +			return -EINVAL;
>>>> +
>>>> +		elevator = ctx->new.elevator = elevator_alloc(q, e);
>>>
>>> You needn't to allocate elevator queue here, then the big elv_change_ctx change
>>> may be avoided by passing sched tags directly.
>>
>> Hmm, okay so you recommend keeping allocation of elevator queue in ->init_sched
>> method as it's today?
> 
> I meant you can move elevator queue allocation into blk_mq_init_sched(), then
> you just need to pass allocated sched tag via `elv_change_ctx` or function
> parameter.

Yeah got it! Will do it in the next patch.

> 
>> If so then yes it could be done but then as I see the elvator 
>> queue is being freed in elevator_change_done() after we release ->elevator_lock
>> So I thought it's also quite possible in theory to allocate elevator queue outside
>> of ->elevator_lock. Moreover, we shall not require holding ->elevator_lock or 
>> ->freeze_lock while allocating elevator queue, and thus probably avoid holding
>> block layer locks while perfroming dynamic allocation, isn't it?
>>
>>>
>>> Also it may be fragile to allocate elevator queue here without holding elevator
>>> lock. You may have to document this way is safe by allocating elevator
>>> queue with specified elevator type lockless.
>>>
>> Yes I'd document it. And looking through this code path it appears safe because
>> we first get reference to the elevator type before assiging elevator type to
>> the elevator queue in elevator_alloc(). Do you see any potential issue with 
>> this, if we were to allocate elevator queue lockless here?
> 
> The correctness depends that our current implementation disables 
> updating nr_hw_queues by holding set->update_nr_hwq_lock, otherwise you
> have to double check nr_hw_queues after grabbing elevator_lock, so this
> point might need to document.
> 
Yep! In fact you might have seen I commented about it in elv_update_nr_hw_queues() 
just before we call blk_mq_alloc_elevator_and_sched_tags(). But now anyways, as
discussed above, we're moving elevator_queue allocation under blk_mq_init_sched(),
we may not need to document this.

>>
>>>> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
>>>> index 1326526bb733..6c0f1936b81c 100644
>>>> --- a/block/blk-mq-sched.h
>>>> +++ b/block/blk-mq-sched.h
>>>> @@ -7,6 +7,26 @@
>>>>  
>>>>  #define MAX_SCHED_RQ (16 * BLKDEV_DEFAULT_RQ)
>>>>  
>>>> +/* Holding context data for changing elevator */
>>>> +struct elv_change_ctx {
>>>> +	/* for unregistering/freeing old elevator */
>>>> +	struct {
>>>> +		struct elevator_queue *elevator;
>>>> +	} old;
>>>> +
>>>> +	/* for registering/allocating new elevator */
>>>> +	struct {
>>>> +		const char *name;
>>>> +		bool no_uevent;
>>>> +		unsigned long nr_requests;
>>>> +		struct elevator_queue *elevator;
>>>> +		int inited;
>>>> +	} new;
>>>> +
>>>> +	/* elevator switch status */
>>>> +	int status;
>>>> +};
>>>> +
>>>
>>> As I mentioned, 'elv_change_ctx' becomes more complicated, which may be
>>> avoided.
>> Yes I agreed, however as I mentioned above if we choose to allocate 
>> elevator queue before we acquire ->elevator_lock or ->freeze_lock
> 
> Allocating elevator queue is just plain kmalloc(), which can be done
> under ->elevator_lock or ->freeze_lock, since we have called
> memalloc_noio_save().
> 
Yes agreed!

>> then we may need to keep this updated elv_change_ctx. So it depends
>> on whether we choose to allocate elevator queue befor acquiring locks or
>> keeping allocation under ->init_sched.
>>
>>> Queue is still frozen, so the dependency between freeze lock and
>>> the global percpu allocation lock isn't cut completely.
>>>
>> Yes correct and so in the commit message I mentioned that this depedency is
>> currenlty cut only while calling elevator switch from elv_iosched_store context
>> through sysfs. We do need to still cut it in the conetxt of elevator switch 
>> being called from blk_mq_update_nr_hw_queues context or called while unregistering 
>> queue from del_gendisk. And this I thought should be handled in a seperate patch.
>> Anyways, I would explicitly call this out in the commit message.
> 
> OK, looks fine, I'd suggest to address them all in single patchset, because
> it is same problem, and helps us to review with single context.
> 
Alright, I'd make a patchset and handle all cases.

Thanks,
--Nilay