Re: [PATCH] block: don't grab elevator lock during queue initialization

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Thu, 10 Apr 2025 19:06:23 +0530

On 4/10/25 7:40 AM, Ming Lei wrote:
> 
> If new NS needn't to be considered, the issue is easier to deal with updating
> nr_hw_queues, such as:
> 
> - disable scheduler in 1st iterating tag_list
> 
> - update nr_hw_queues in 2nd iterating tag_list
> 
> - restore scheduler in 3rd iterating tag_list
> 
> ->tag_list_lock is acquired & released in each iteration.
> 
> Then per-set lock isn't needed any more.
> 
Hmm still thinking...
>>>
>>> Please see the attached patch which treats elv_iosched_store() as reader.
>>>
>> I looked trough patch and looks good. However we still have an issue 
>> in code paths where we acquire ->elevator_lock without first freezing 
>> the queue.In this case if we try allocate memory using GFP_KERNEL 
>> then fs_reclaim would still trigger lockdep splat. As for this case
>> the lock order would be,
>>
>> elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io)
> 
> That is why I call ->elevator_lock is used too much, especially it is
> depended for dealing with kobject & sysfs, which isn't needed in this way.
> 
>>
>> In fact, I tested your patch and got into the above splat. I've 
>> attached lockdep splat for your reference. So IMO, we should instead
>> try make locking order such that ->freeze_lock shall never depend on
>> ->elevator_lock. It seems one way to make that possible if we make
>> ->elevator_lock per-set. 
> 
> The attached patch is just for avoiding race between add/del disk vs.
> updating nr_hw_queues, and it should have been for covering race between
> adding/exiting request queue vs. updating nr_hw_queues.
> 
Okay understood.

> If re-order can be done, that is definitely good, but...
Yeah so I tried re-ordering locks so that we first grab q->elevator_lock 
and then ->freeze_lock. As we know there's a challenge with re-arranging
the lock order in __blk_mq_update_nr_hw_queues, but I managed to rectify 
the lock order. I added one more tag_list iterator where we first acquire
->elevator lock and then in the next tag_list iterator we acquire 
->freeze_lock. I have also updated this lock order at other call sites.

Then as we have already discussed, there's also another challenge re-arranging
the lock order in del_gendisk, however, I managed to mitigate that by moving 
elevator_exit and elv_unregister_queue (from blk_unregister_queue)  just after
we delete queue tag set (or in another words when remove the queue from the 
tag-list) in del_gendisk. So that means that we could now safely invoke 
elv_unregister_queue and elevator_exit from  del_gendisk without needing to
acquire ->elevator_lock.

For reference, I attached a (informal) patch. Yes I know this patch would not
fix all splats. But at-least we would stop observing splat related to 
->frezze_lock dependency on ->elevator_lcok. 

> 
>>
>>>>>
>>>>> Actually freeze lock is already held for nvme before calling
>>>>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
>>>>> frozen for updating nr_hw_queues, so the above order may not match
>>>>> with the existed code.
>>>>>
>>>>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
>>>>>
>>>> I think we should consider (may be in different patch) updating
>>>> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
>>>> its dependency on ->tag_list_lock.
>>>
>>> If we need to take nvme into account, the above lock order doesn't work,
>>> because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
>>> and elevator lock still depends on freeze lock.
>>>
>>> If it needn't to be considered, per-set lock becomes necessary.
>> IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues
>> we shall also update nvme pci, rdma and tcp drivers so that those 
>> drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues
>> nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that
>> nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues.
> 
> This way you cause new deadlock or new trouble if you reply on freeze in
> blk_mq_update_nr_hw_queues():
> 
>  ->tag_list_lock
>  	freeze_lock
> 
> If timeout or io failure happens during the above freeze_lock, timeout
> handler can not move on because new blk_mq_update_nr_hw_queues() can't
> grab the lock.
> 
> Either deadlock or device has to been removed.
> 
With this new attached patch do you still foresee this issue? Yes this patch 
doesn't change anything with nvme driver, but later we may update nvme driver
so that nvme driver doesn't require freezing queue before invoking blk_mq_
update_nr_hw_queues. I think this is requires so that we follow the same lock
order in all call paths wrt ->elevator_lock and ->freeze_lock.

>>> As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
>>> host wide event, so either lock or srcu sync is needed.
>> Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel 
>> with del_gendisk or blk_unregister_queue.
> 
> Then the per-set lock is required in both add/del disk, then how to re-order
> freeze_lock & elevator lock in del_gendisk()? 
> 
> And there is same risk with the one in blk_mq_update_nr_hw_queues().
> 
Yes please see above as I explained how we could potentially avoid lock order 
issue in del_gendisk.
> 
> Another ways is to make sure that ->elevator_lock isn't required for dealing
> with kobject/debugfs thing, which needs to refactor elevator code.
> 
> Such as, ->elevator_lock is grabbed in debugfs handler, removing sched debugfs
> actually need to drain reader, that is deadlock too.
> 
I do agree. But I think lets first focus on cutting dependency of ->freeze_lock 
on ->elevator_lock. Once we get past it we may address other. 

This commit ffa1e7ada456 ("block: Make request_queue lockdep splats show up 
earlier") has now opened up pandora's box of lockdep splats :) 
Anyways it's good that we could now catch these issues early on. In general,
I feel that now this change would show up early on the issues where ->freeze_lock
depends on any other locks. 

Thanks,
--Nilay

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5905f277057b..3a4107f3550a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -845,9 +845,9 @@ unsigned long __must_check blkg_conf_open_bdev_frozen(struct blkg_conf_ctx *ctx)
 	 */
 	mutex_unlock(&ctx->bdev->bd_queue->rq_qos_mutex);
 
-	memflags = blk_mq_freeze_queue(ctx->bdev->bd_queue);
 	mutex_lock(&ctx->bdev->bd_queue->elevator_lock);
 	mutex_lock(&ctx->bdev->bd_queue->rq_qos_mutex);
+	memflags = blk_mq_freeze_queue(ctx->bdev->bd_queue);
 
 	return memflags;
 }
@@ -1017,9 +1017,9 @@ void blkg_conf_exit_frozen(struct blkg_conf_ctx *ctx, unsigned long memflags)
 	if (ctx->bdev) {
 		struct request_queue *q = ctx->bdev->bd_queue;
 
+		blk_mq_unfreeze_queue(q, memflags);
 		blkg_conf_exit(ctx);
 		mutex_unlock(&q->elevator_lock);
-		blk_mq_unfreeze_queue(q, memflags);
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c2697db59109..e3ace6b5a1e1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4094,8 +4094,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;
 
-	mutex_lock(&q->elevator_lock);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		cpumask_clear(hctx->cpumask);
 		hctx->nr_ctx = 0;
@@ -4200,8 +4198,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
 		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
 	}
-
-	mutex_unlock(&q->elevator_lock);
 }
 
 /*
@@ -4505,16 +4501,9 @@ static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-				   struct request_queue *q, bool lock)
+					struct request_queue *q)
 {
-	if (lock) {
-		/* protect against switching io scheduler  */
-		mutex_lock(&q->elevator_lock);
-		__blk_mq_realloc_hw_ctxs(set, q);
-		mutex_unlock(&q->elevator_lock);
-	} else {
-		__blk_mq_realloc_hw_ctxs(set, q);
-	}
+	__blk_mq_realloc_hw_ctxs(set, q);
 
 	/* unregister cpuhp callbacks for exited hctxs */
 	blk_mq_remove_hw_queues_cpuhp(q);
@@ -4546,7 +4535,8 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	xa_init(&q->hctx_table);
 
-	blk_mq_realloc_hw_ctxs(set, q, false);
+	blk_mq_realloc_hw_ctxs(set, q);
+
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
@@ -4564,7 +4554,11 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
 	blk_mq_add_queue_tag_set(set, q);
+
+	mutex_lock(&q->elevator_lock);
 	blk_mq_map_swqueue(q);
+	mutex_unlock(&q->elevator_lock);
+
 	return 0;
 
 err_hctxs:
@@ -4947,12 +4941,9 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
 	if (!qe)
 		return false;
 
-	/* Accessing q->elevator needs protection from ->elevator_lock. */
-	mutex_lock(&q->elevator_lock);
-
 	if (!q->elevator) {
 		kfree(qe);
-		goto unlock;
+		return true;
 	}
 
 	INIT_LIST_HEAD(&qe->node);
@@ -4962,8 +4953,6 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
 	__elevator_get(qe->type);
 	list_add(&qe->node, head);
 	elevator_disable(q);
-unlock:
-	mutex_unlock(&q->elevator_lock);
 
 	return true;
 }
@@ -4993,11 +4982,9 @@ static void blk_mq_elv_switch_back(struct list_head *head,
 	list_del(&qe->node);
 	kfree(qe);
 
-	mutex_lock(&q->elevator_lock);
 	elevator_switch(q, t);
 	/* drop the reference acquired in blk_mq_elv_switch_none */
 	elevator_put(t);
-	mutex_unlock(&q->elevator_lock);
 }
 
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
@@ -5018,6 +5005,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
 		return;
 
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		mutex_lock(&q->elevator_lock);
+
 	memflags = memalloc_noio_save();
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue_nomemsave(q);
@@ -5042,7 +5032,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 fallback:
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q, true);
+		blk_mq_realloc_hw_ctxs(set, q);
 
 		if (q->nr_hw_queues != set->nr_hw_queues) {
 			int i = prev_nr_hw_queues;
@@ -5072,6 +5062,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_unfreeze_queue_nomemrestore(q);
 	memalloc_noio_restore(memflags);
 
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		mutex_unlock(&q->elevator_lock);
+
 	/* Free the excess tags when nr_hw_queues shrink. */
 	for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
 		__blk_mq_free_map_and_rqs(set, i);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a2882751f0d2..8a58db7d7254 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -76,16 +76,16 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 	if (ret < 0)
 		return ret;
 
-	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
+	memflags = blk_mq_freeze_queue(q);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	err = blk_mq_update_nr_requests(disk->queue, nr);
 	if (err)
 		ret = err;
-	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+	mutex_unlock(&q->elevator_lock);
 	return ret;
 }
 
@@ -949,10 +949,6 @@ void blk_unregister_queue(struct gendisk *disk)
 		blk_mq_sysfs_unregister(disk);
 	blk_crypto_sysfs_unregister(disk);
 
-	mutex_lock(&q->elevator_lock);
-	elv_unregister_queue(q);
-	mutex_unlock(&q->elevator_lock);
-
 	mutex_lock(&q->sysfs_lock);
 	disk_unregister_independent_access_ranges(disk);
 	mutex_unlock(&q->sysfs_lock);
diff --git a/block/elevator.c b/block/elevator.c
index b4d08026b02c..370c4e39fe90 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -481,8 +481,6 @@ void elv_unregister_queue(struct request_queue *q)
 {
 	struct elevator_queue *e = q->elevator;
 
-	lockdep_assert_held(&q->elevator_lock);
-
 	if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
 		kobject_uevent(&e->kobj, KOBJ_REMOVE);
 		kobject_del(&e->kobj);
@@ -731,13 +729,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 
 	elv_iosched_load_module(name);
 
-	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
+	memflags = blk_mq_freeze_queue(q);
 	ret = elevator_change(q, name);
 	if (!ret)
 		ret = count;
-	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+	mutex_unlock(&q->elevator_lock);
 	return ret;
 }
 
diff --git a/block/genhd.c b/block/genhd.c
index c2bd86cd09de..597bd94b9442 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -744,11 +744,6 @@ void del_gendisk(struct gendisk *disk)
 		blk_mq_cancel_work_sync(q);
 
 	blk_mq_quiesce_queue(q);
-	if (q->elevator) {
-		mutex_lock(&q->elevator_lock);
-		elevator_exit(q);
-		mutex_unlock(&q->elevator_lock);
-	}
 	rq_qos_exit(q);
 	blk_mq_unquiesce_queue(q);
 
@@ -761,6 +756,10 @@ void del_gendisk(struct gendisk *disk)
 	else if (queue_is_mq(q))
 		blk_mq_exit_queue(q);
 
+	if (q->elevator) {
+		elv_unregister_queue(q);
+		elevator_exit(q);
+	}
 	if (start_drain)
 		blk_unfreeze_release_lock(q);
 }
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cc23035148b4..963077812856 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2840,7 +2840,12 @@ static const struct nvme_core_quirk_entry core_quirks[] = {
 		.quirks = NVME_QUIRK_DELAY_BEFORE_CHK_RDY |
 			  NVME_QUIRK_NO_DEEPEST_PS |
 			  NVME_QUIRK_IGNORE_DEV_SUBNQN,
-	}
+	},
+	{
+		.vid = 0x144d,
+		.fr = "DNA8DNA8",
+		.quirks = NVME_QUIRK_SINGLE_PORT,
+	},
 };
 
 /* match is null-terminated but idstr is space-padded. */
@@ -3338,6 +3343,9 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
 				ctrl->quirks |= core_quirks[i].quirks;
 		}
 
+		if (ctrl->quirks & NVME_QUIRK_SINGLE_PORT)
+			id->cmic = 0;
+
 		ret = nvme_init_subsystem(ctrl, id);
 		if (ret)
 			goto out_free;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 51e078642127..c3fc818a25de 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -178,6 +178,11 @@ enum nvme_quirks {
 	 * Align dma pool segment size to 512 bytes
 	 */
 	NVME_QUIRK_DMAPOOL_ALIGN_512		= (1 << 22),
+
+	/*
+	 * Assume single port disk even if it advertise multi port
+	 */
+	NVME_QUIRK_SINGLE_PORT			= (1 << 23),
 };
 
 /*