Re: [PATCHv2] block: fix lock dependency between percpu alloc lock and elevator lock

Ming Lei <ming.lei@xxxxxxxxxx> · Mon, 9 Jun 2025 10:13:56 +0800

On Thu, Jun 05, 2025 at 11:22:17AM +0530, Nilay Shroff wrote:
> 
> 
> On 6/3/25 2:07 PM, Ming Lei wrote:
> > On Wed, May 28, 2025 at 06:03:58PM +0530, Nilay Shroff wrote:
> 
> >> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> >> index 0cb1e9873aab..51d406da4abf 100644
> >> --- a/block/bfq-iosched.c
> >> +++ b/block/bfq-iosched.c
> >> @@ -7232,17 +7232,12 @@ static void bfq_init_root_group(struct bfq_group *root_group,
> >>  	root_group->sched_data.bfq_class_idle_last_service = jiffies;
> >>  }
> >>  
> >> -static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
> >> +static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
> >>  {
> >>  	struct bfq_data *bfqd;
> >> -	struct elevator_queue *eq;
> >>  	unsigned int i;
> >>  	struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
> >>  
> >> -	eq = elevator_alloc(q, e);
> >> -	if (!eq)
> >> -		return -ENOMEM;
> >> -
> > 
> > Please make the above elevator interface change be one standalone patch if possible,
> > which may help review much.
> 
> Yeah you're right, in fact I considered to make elevator interface changes into 
> a separate patch however the changes are so much tightly coupled with block layer
> change that I couldn't make it independent. As you may see here I replaced second
> argument of ->init_sched from "struct elevator_type *" to "struct elevator_queue *".
> And as you might have noticed we allocate the "struct elevator_queue *" very early
> while switching the elevator and that's passed down to this ->init_sched function
> through elv_change_ctx. So making elevator interface change a standalone patch 
> is not possible.

You can make a patch to move elevator_queue allocation into
blk_mq_init_sched() first, then pass it to all ->init_sched()
implementation first. And you can avoid to pass 'elevator_queue *'
from 'struct elv_change_ctx'.

> 
> >>  
> >> +int blk_mq_alloc_elevator_and_sched_tags(struct request_queue *q,
> >> +		struct elv_change_ctx *ctx)
> >> +{
> >> +	unsigned long i;
> >> +	struct elevator_queue *elevator;
> >> +	struct elevator_type *e;
> >> +	struct blk_mq_tag_set *set = q->tag_set;
> >> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> >> +
> >> +	if (strncmp(ctx->new.name, "none", 4)) {
> > 
> > You can return early if new name is "none", then all indent in the code
> > block can be avoided.
> Yeah agreed, will do it in the next patch.
> 
> > 
> >> +
> >> +		e = elevator_find_get(ctx->new.name);
> >> +		if (!e)
> >> +			return -EINVAL;
> >> +
> >> +		elevator = ctx->new.elevator = elevator_alloc(q, e);
> > 
> > You needn't to allocate elevator queue here, then the big elv_change_ctx change
> > may be avoided by passing sched tags directly.
> 
> Hmm, okay so you recommend keeping allocation of elevator queue in ->init_sched
> method as it's today?

I meant you can move elevator queue allocation into blk_mq_init_sched(), then
you just need to pass allocated sched tag via `elv_change_ctx` or function
parameter.

> If so then yes it could be done but then as I see the elvator 
> queue is being freed in elevator_change_done() after we release ->elevator_lock
> So I thought it's also quite possible in theory to allocate elevator queue outside
> of ->elevator_lock. Moreover, we shall not require holding ->elevator_lock or 
> ->freeze_lock while allocating elevator queue, and thus probably avoid holding
> block layer locks while perfroming dynamic allocation, isn't it?
> 
> > 
> > Also it may be fragile to allocate elevator queue here without holding elevator
> > lock. You may have to document this way is safe by allocating elevator
> > queue with specified elevator type lockless.
> > 
> Yes I'd document it. And looking through this code path it appears safe because
> we first get reference to the elevator type before assiging elevator type to
> the elevator queue in elevator_alloc(). Do you see any potential issue with 
> this, if we were to allocate elevator queue lockless here?

The correctness depends that our current implementation disables 
updating nr_hw_queues by holding set->update_nr_hwq_lock, otherwise you
have to double check nr_hw_queues after grabbing elevator_lock, so this
point might need to document.

> 
> >> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> >> index 1326526bb733..6c0f1936b81c 100644
> >> --- a/block/blk-mq-sched.h
> >> +++ b/block/blk-mq-sched.h
> >> @@ -7,6 +7,26 @@
> >>  
> >>  #define MAX_SCHED_RQ (16 * BLKDEV_DEFAULT_RQ)
> >>  
> >> +/* Holding context data for changing elevator */
> >> +struct elv_change_ctx {
> >> +	/* for unregistering/freeing old elevator */
> >> +	struct {
> >> +		struct elevator_queue *elevator;
> >> +	} old;
> >> +
> >> +	/* for registering/allocating new elevator */
> >> +	struct {
> >> +		const char *name;
> >> +		bool no_uevent;
> >> +		unsigned long nr_requests;
> >> +		struct elevator_queue *elevator;
> >> +		int inited;
> >> +	} new;
> >> +
> >> +	/* elevator switch status */
> >> +	int status;
> >> +};
> >> +
> > 
> > As I mentioned, 'elv_change_ctx' becomes more complicated, which may be
> > avoided.
> Yes I agreed, however as I mentioned above if we choose to allocate 
> elevator queue before we acquire ->elevator_lock or ->freeze_lock

Allocating elevator queue is just plain kmalloc(), which can be done
under ->elevator_lock or ->freeze_lock, since we have called
memalloc_noio_save().

> then we may need to keep this updated elv_change_ctx. So it depends
> on whether we choose to allocate elevator queue befor acquiring locks or
> keeping allocation under ->init_sched.
> 
> > Queue is still frozen, so the dependency between freeze lock and
> > the global percpu allocation lock isn't cut completely.
> > 
> Yes correct and so in the commit message I mentioned that this depedency is
> currenlty cut only while calling elevator switch from elv_iosched_store context
> through sysfs. We do need to still cut it in the conetxt of elevator switch 
> being called from blk_mq_update_nr_hw_queues context or called while unregistering 
> queue from del_gendisk. And this I thought should be handled in a seperate patch.
> Anyways, I would explicitly call this out in the commit message.

OK, looks fine, I'd suggest to address them all in single patchset, because
it is same problem, and helps us to review with single context.

Thanks,
Ming