Re: [PATCH] block: don't grab elevator lock during queue initialization

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Thu, 10 Apr 2025 01:15:22 +0530



On 4/9/25 7:38 PM, Ming Lei wrote:
>>>>> __blk_mq_update_nr_hw_queues is the only chance for tagset wide queues
>>>>> involved wrt. switching elevator. If elevator switching is not allowed
>>>>> when __blk_mq_update_nr_hw_queues() is started, why do we need per-set
>>>>> lock?
>>>>>
>>>> Yes if elevator switch is not allowed then we probably don't need per-set lock. 
>>>> However my question was if we were to not allow elevator switch while 
>>>> __blk_mq_update_nr_hw_queues is running then how would we implement it?
>>>
>>> It can be done easily by tag_set->srcu.
>> Ok great if that's possible! But I'm not sure how it could be done in this
>> case. I think both __blk_mq_update_nr_hw_queues and elv_iosched_store
>> run in the writer/updater context. So you may still need lock? Can you
>> please send across a (informal) patch with your idea ?
> 
> Please see the attached patch which treats elv_iosched_store() as reader.
> 
I looked trough patch and looks good. However we still have an issue 
in code paths where we acquire ->elevator_lock without first freezing 
the queue.In this case if we try allocate memory using GFP_KERNEL 
then fs_reclaim would still trigger lockdep splat. As for this case
the lock order would be,

elevator_lock -> fs_reclaim(GFP_KERNEL) -> &q->q_usage_counter(io)

In fact, I tested your patch and got into the above splat. I've 
attached lockdep splat for your reference. So IMO, we should instead
try make locking order such that ->freeze_lock shall never depend on
->elevator_lock. It seems one way to make that possible if we make
->elevator_lock per-set. 

>>>
>>> Actually freeze lock is already held for nvme before calling
>>> blk_mq_update_nr_hw_queues, and it is reasonable to suppose queue
>>> frozen for updating nr_hw_queues, so the above order may not match
>>> with the existed code.
>>>
>>> Do we need to consider nvme or blk_mq_update_nr_hw_queues now?
>>>
>> I think we should consider (may be in different patch) updating
>> nvme_quiesce_io_queues and nvme_unquiesce_io_queues and remove
>> its dependency on ->tag_list_lock.
> 
> If we need to take nvme into account, the above lock order doesn't work,
> because nvme freezes queue before calling blk_mq_update_nr_hw_queues(),
> and elevator lock still depends on freeze lock.
> 
> If it needn't to be considered, per-set lock becomes necessary.
IMO, in addition to nvme_quiesce_io_queues and nvme_unquiesce_io_queues
we shall also update nvme pci, rdma and tcp drivers so that those 
drivers neither freeze queue prior to invoking blk_mq_update_nr_hw_queues
nor unfreeze queue after blk_mq_update_nr_hw_queues returns. I see that
nvme loop and fc don't freeze queue prior to invoking blk_mq_update_nr_hw_queues.

>>>>
>>>> So now ->freeze_lock should not depend on ->elevator_lock and that shall
>>>> help avoid few of the recent lockdep splats reported with fs_reclaim.
>>>> What do you think?
>>>
>>> Yes, reordering ->freeze_lock and ->elevator_lock may avoid many fs_reclaim
>>> related splat.
>>>
>>> However, in del_gendisk(), freeze_lock is still held before calling
>>> elevator_exit() and blk_unregister_queue(), and looks not easy to reorder.
>>
>> Yes agreed, however elevator_exit() called from del_gendisk() or 
>> elv_unregister_queue() called from blk_unregister_queue() are called 
>> after we unregister the queue. And if queue has been already unregistered
>> while invoking elevator_exit or del_gensidk then ideally we don't need to
>> acquire ->elevator_lock. The same is true for elevator_exit() called 
>> from add_disk_fwnode(). So IMO, we should update these paths to avoid 
>> acquiring ->elevator_lock.
> 
> As I mentioned, blk_mq_update_nr_hw_queues() still can come, which is one
> host wide event, so either lock or srcu sync is needed.
Yes agreed, I see that blk_mq_update_nr_hw_queues can run in parallel 
with del_gendisk or blk_unregister_queue.

Thanks,
--Nilay
[  399.112145] ======================================================
[  399.112153] WARNING: possible circular locking dependency detected
[  399.112160] 6.15.0-rc1+ #159 Not tainted
[  399.112167] ------------------------------------------------------
[  399.112174] bash/5891 is trying to acquire lock:
[  399.112181] c0000000bc2334f8 (&q->elevator_lock){+.+.}-{4:4}, at: elv_iosched_store+0x11c/0x5d4
[  399.112205]
[  399.112205] but task is already holding lock:
[  399.112211] c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
[  399.112231]
[  399.112231] which lock already depends on the new lock.
[  399.112231]
[  399.112239]
[  399.112239] the existing dependency chain (in reverse order) is:
[  399.112246]
[  399.112246] -> #2 (&q->q_usage_counter(io)#20){++++}-{0:0}:
[  399.112260]        blk_alloc_queue+0x3a8/0x3e4
[  399.112268]        blk_mq_alloc_queue+0x88/0x11c
[  399.112278]        __blk_mq_alloc_disk+0x34/0xd8
[  399.112290]        null_add_dev+0x3c8/0x914 [null_blk]
[  399.112306]        null_init+0x1e0/0x4bc [null_blk]
[  399.112319]        do_one_initcall+0x8c/0x4b8
[  399.112340]        do_init_module+0x7c/0x2c4
[  399.112396]        init_module_from_file+0xb4/0x108
[  399.112405]        idempotent_init_module+0x26c/0x368
[  399.112414]        sys_finit_module+0x98/0x150
[  399.112422]        system_call_exception+0x134/0x360
[  399.112430]        system_call_vectored_common+0x15c/0x2ec
[  399.112441]
[  399.112441] -> #1 (fs_reclaim){+.+.}-{0:0}:
[  399.112453]        fs_reclaim_acquire+0xe4/0x120
[  399.112461]        kmem_cache_alloc_noprof+0x74/0x570
[  399.112469]        __kernfs_new_node+0x98/0x37c
[  399.112479]        kernfs_new_node+0x94/0xe4
[  399.112490]        kernfs_create_dir_ns+0x44/0x120
[  399.112501]        sysfs_create_dir_ns+0x94/0x160
[  399.112512]        kobject_add_internal+0xf4/0x3c8
[  399.112524]        kobject_add+0x70/0x10c
[  399.112533]        elv_register_queue+0x70/0x14c
[  399.112562]        blk_register_queue+0x1d8/0x2bc
[  399.112575]        add_disk_fwnode+0x3b4/0x5d0
[  399.112588]        sd_probe+0x3bc/0x5b4 [sd_mod]
[  399.112601]        really_probe+0x104/0x4c4
[  399.112613]        __driver_probe_device+0xb8/0x200
[  399.112623]        driver_probe_device+0x54/0x128
[  399.112632]        __driver_attach_async_helper+0x7c/0x150
[  399.112641]        async_run_entry_fn+0x60/0x1bc
[  399.112665]        process_one_work+0x2ac/0x7e4
[  399.112678]        worker_thread+0x238/0x460
[  399.112691]        kthread+0x158/0x188
[  399.112700]        start_kernel_thread+0x14/0x18
[  399.112722]
[  399.112722] -> #0 (&q->elevator_lock){+.+.}-{4:4}:
[  399.112737]        __lock_acquire+0x1bec/0x2bc8
[  399.112747]        lock_acquire+0x140/0x430
[  399.112760]        __mutex_lock+0xf0/0xb00
[  399.112769]        elv_iosched_store+0x11c/0x5d4
[  399.112781]        queue_attr_store+0x12c/0x164
[  399.112792]        sysfs_kf_write+0x74/0xc4
[  399.112804]        kernfs_fop_write_iter+0x1a8/0x2a4
[  399.112816]        vfs_write+0x410/0x584
[  399.112829]        ksys_write+0x84/0x140
[  399.112841]        system_call_exception+0x134/0x360
[  399.112851]        system_call_vectored_common+0x15c/0x2ec
[  399.112863]
[  399.112863] other info that might help us debug this:
[  399.112863]
[  399.112874] Chain exists of:
[  399.112874]   &q->elevator_lock --> fs_reclaim --> &q->q_usage_counter(io)#20
[  399.112874]
[  399.112896]  Possible unsafe locking scenario:
[  399.112896]
[  399.112905]        CPU0                    CPU1
[  399.112928]        ----                    ----
[  399.112936]   lock(&q->q_usage_counter(io)#20);
[  399.112954]                                lock(fs_reclaim);
[  399.112967]                                lock(&q->q_usage_counter(io)#20);
[  399.112987]   lock(&q->elevator_lock);
[  399.112998]
[  399.112998]  *** DEADLOCK ***
[  399.112998]
[  399.113010] 5 locks held by bash/5891:
[  399.113017]  #0: c000000009f87400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
[  399.113042]  #1: c0000000097cfa88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x164/0x2a4
[  399.113064]  #2: c0000000ba4634f8 (kn->active#57){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x170/0x2a4
[  399.113083]  #3: c0000000bc232fd8 (&q->q_usage_counter(io)#20){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
[  399.113102]  #4: c0000000bc233010 (&q->q_usage_counter(queue)#21){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
[  399.113120]
[  399.113120] stack backtrace:
[  399.113126] CPU: 24 UID: 0 PID: 5891 Comm: bash Kdump: loaded Not tainted 6.15.0-rc1+ #159 VOLUNTARY
[  399.113130] Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
[  399.113132] Call Trace:
[  399.113133] [c000000082bd7580] [c0000000011b69f8] dump_stack_lvl+0x108/0x18c (unreliable)
[  399.113141] [c000000082bd75b0] [c00000000022512c] print_circular_bug+0x448/0x604
[  399.113146] [c000000082bd7660] [c000000000225534] check_noncircular+0x24c/0x26c
[  399.113150] [c000000082bd7730] [c00000000022b998] __lock_acquire+0x1bec/0x2bc8
[  399.113155] [c000000082bd7860] [c000000000228d20] lock_acquire+0x140/0x430
[  399.113159] [c000000082bd7960] [c0000000011f76e8] __mutex_lock+0xf0/0xb00
[  399.113163] [c000000082bd7a90] [c00000000090a558] elv_iosched_store+0x11c/0x5d4
[  399.113169] [c000000082bd7b50] [c000000000912d20] queue_attr_store+0x12c/0x164
[  399.113174] [c000000082bd7c60] [c0000000007d9114] sysfs_kf_write+0x74/0xc4
[  399.113179] [c000000082bd7ca0] [c0000000007d5b60] kernfs_fop_write_iter+0x1a8/0x2a4
[  399.113185] [c000000082bd7cf0] [c0000000006b386c] vfs_write+0x410/0x584
[  399.113190] [c000000082bd7dc0] [c0000000006b3d18] ksys_write+0x84/0x140
[  399.113196] [c000000082bd7e10] [c000000000031814] system_call_exception+0x134/0x360
[  399.113199] [c000000082bd7e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
[  399.113205] --- interrupt: 3000 at 0x7fffb113b034
[  399.113208] NIP:  00007fffb113b034 LR: 00007fffb113b034 CTR: 0000000000000000
[  399.113211] REGS: c000000082bd7e80 TRAP: 3000   Not tainted  (6.15.0-rc1+)
[  399.113213] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 44422408  XER: 00000000
[  399.113223] IRQMASK: 0
[  399.113223] GPR00: 0000000000000004 00007fffe78cf020 00000001114d7e00 0000000000000001
[  399.113223] GPR04: 000000014ce03ae0 000000000000000c 0000000000000010 0000000000000001
[  399.113223] GPR08: 000000000000000b 0000000000000000 0000000000000000 0000000000000000
[  399.113223] GPR12: 0000000000000000 00007fffb139ab60 000000014cef4ed0 00000001114d87b8
[  399.113223] GPR16: 00000001114d94d8 0000000020000000 0000000000000000 00000001113e9070
[  399.113223] GPR20: 000000011147beb8 00007fffe78cf1c4 000000011147f8a0 00000001114d89bc
[  399.113223] GPR24: 00000001114d8a50 0000000000000000 000000014ce03ae0 000000000000000c
[  399.113223] GPR28: 000000000000000c 00007fffb12418e0 000000014ce03ae0 000000000000000c
[  399.113256] NIP [00007fffb113b034] 0x7fffb113b034
[  399.113258] LR [00007fffb113b034] 0x7fffb113b034
[  399.113260] --- interrupt: 3000