This is a multi-part message in MIME format.
On 4/24/25 8:51 PM, Ming Lei wrote: > scheduler's ->exit() is called with queue frozen and elevator lock is held, and > wbt_enable_default() can't be called with queue frozen, otherwise the > following lockdep warning is triggered: > > #6 (&q->rq_qos_mutex){+.+.}-{4:4}: > #5 (&eq->sysfs_lock){+.+.}-{4:4}: > #4 (&q->elevator_lock){+.+.}-{4:4}: > #3 (&q->q_usage_counter(io)#3){++++}-{0:0}: > #2 (fs_reclaim){+.+.}-{0:0}: > #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: > #0 (&q->debugfs_mutex){+.+.}-{4:4}: > > Fix the issue by moving wbt_enable_default() out of bfq's exit(), and > call it from elevator_change_done(). > > Meantime add disk->rqos_state_mutex for covering wbt state change, which > matches the purpose more than ->elevator_lock. > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> While testing this patch on my machine using blktests, I stumbled upon a lockdep splat shown below.(I could consistently recreate it): run blktests block/005 at 2025-04-28 06:57:51 ====================================================== WARNING: possible circular locking dependency detected 6.15.0-rc2+ #174 Not tainted ------------------------------------------------------ check/8088 is trying to acquire lock: c0000000a0c03538 (&disk->rqos_state_mutex){+.+.}-{4:4}, at: wbt_disable_default+0x9c/0x118 but task is already holding lock: c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->elevator_lock){+.+.}-{4:4}: __mutex_lock+0x128/0xdd8 elevator_change+0x94/0x214 elv_iosched_store+0x14c/0x1f4 queue_attr_store+0x194/0x1d0 sysfs_kf_write+0xbc/0x110 kernfs_fop_write_iter+0x264/0x384 vfs_write+0x5b0/0x77c ksys_write+0xa0/0x180 system_call_exception+0x1b0/0x4f0 system_call_vectored_common+0x15c/0x2ec -> #2 (&q->q_usage_counter(io)#23){++++}-{0:0}: blk_alloc_queue+0x46c/0x4bc blk_mq_alloc_queue+0xc0/0x160 __blk_mq_alloc_disk+0x34/0x128 nvme_alloc_ns+0x140/0x1804 [nvme_core] nvme_scan_ns+0x42c/0x564 [nvme_core] async_run_entry_fn+0x9c/0x30c process_one_work+0x514/0xd38 worker_thread+0x390/0x6dc kthread+0x230/0x278 start_kernel_thread+0x14/0x18 -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x114/0x150 __kmalloc_cache_noprof+0x70/0x5c0 wbt_init+0x64/0x2fc wbt_enable_default+0x140/0x15c elevator_change_done+0x314/0x3a8 elv_iosched_store+0x14c/0x1f4 queue_attr_store+0x194/0x1d0 sysfs_kf_write+0xbc/0x110 kernfs_fop_write_iter+0x264/0x384 vfs_write+0x5b0/0x77c ksys_write+0xa0/0x180 system_call_exception+0x1b0/0x4f0 system_call_vectored_common+0x15c/0x2ec -> #0 (&disk->rqos_state_mutex){+.+.}-{4:4}: __lock_acquire+0x1b5c/0x29f8 lock_acquire+0x23c/0x3f8 __mutex_lock+0x128/0xdd8 wbt_disable_default+0x9c/0x118 bfq_init_queue+0x7b0/0x8c0 blk_mq_init_sched+0x29c/0x3a8 __elevator_change+0x3a4/0x8a4 elevator_change+0x1a4/0x214 elv_iosched_store+0x14c/0x1f4 queue_attr_store+0x194/0x1d0 sysfs_kf_write+0xbc/0x110 kernfs_fop_write_iter+0x264/0x384 vfs_write+0x5b0/0x77c ksys_write+0xa0/0x180 system_call_exception+0x1b0/0x4f0 system_call_vectored_common+0x15c/0x2ec other info that might help us debug this: Chain exists of: &disk->rqos_state_mutex --> &q->q_usage_counter(io)#23 --> &q->elevator_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->elevator_lock); lock(&q->q_usage_counter(io)#23); lock(&q->elevator_lock); lock(&disk->rqos_state_mutex); *** DEADLOCK *** 7 locks held by check/8088: #0: c0000000873f2400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xa0/0x180 #1: c00000008c10c088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1e0/0x384 #2: c000000085239248 (kn->active#57){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1f8/0x384 #3: c0000000f801c190 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: elv_iosched_store+0x13c/0x1f4 #4: c00000005b8f6718 (&q->q_usage_counter(io)#23){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 #5: c00000005b8f6750 (&q->q_usage_counter(queue)#21){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 #6: c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214 stack backtrace: CPU: 26 UID: 0 PID: 8088 Comm: check Kdump: loaded Not tainted 6.15.0-rc2+ #174 VOLUNTARY Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries Call Trace: [c0000000d7497240] [c0000000017b9888] dump_stack_lvl+0x100/0x184 (unreliable) [c0000000d7497270] [c0000000002b546c] print_circular_bug+0x448/0x604 [c0000000d7497320] [c0000000002b5874] check_noncircular+0x24c/0x26c [c0000000d74973f0] [c0000000002bbb78] __lock_acquire+0x1b5c/0x29f8 [c0000000d7497520] [c0000000002b915c] lock_acquire+0x23c/0x3f8 [c0000000d7497620] [c00000000181277c] __mutex_lock+0x128/0xdd8 [c0000000d7497780] [c000000000c73bf8] wbt_disable_default+0x9c/0x118 [c0000000d74977c0] [c000000000c4c2c0] bfq_init_queue+0x7b0/0x8c0 [c0000000d7497890] [c000000000bff634] blk_mq_init_sched+0x29c/0x3a8 [c0000000d7497910] [c000000000bc2a18] __elevator_change+0x3a4/0x8a4 [c0000000d74979b0] [c000000000bc30bc] elevator_change+0x1a4/0x214 [c0000000d7497a00] [c000000000bc427c] elv_iosched_store+0x14c/0x1f4 [c0000000d7497ae0] [c000000000bd07ec] queue_attr_store+0x194/0x1d0 [c0000000d7497c00] [c000000000a40f00] sysfs_kf_write+0xbc/0x110 [c0000000d7497c50] [c000000000a3cc4c] kernfs_fop_write_iter+0x264/0x384 [c0000000d7497cb0] [c0000000008bb9bc] vfs_write+0x5b0/0x77c [c0000000d7497d90] [c0000000008bbf88] ksys_write+0xa0/0x180 [c0000000d7497df0] [c000000000039f70] system_call_exception+0x1b0/0x4f0 [c0000000d7497e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec --- interrupt: 3000 at 0x7fffa413b034 NIP: 00007fffa413b034 LR: 00007fffa413b034 CTR: 0000000000000000 REGS: c0000000d7497e80 TRAP: 3000 Not tainted (6.15.0-rc2+) MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 44422408 XER: 00000000 IRQMASK: 0 GPR00: 0000000000000004 00007ffffd011260 000000010dfa7e00 0000000000000001 GPR04: 000000011c30b720 0000000000000004 0000000000000010 0000000000000001 GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00007fffa43fab60 000000011c3adbc0 000000010dfa87b8 GPR16: 000000010dfa94d8 0000000020000000 0000000000000000 000000010deb9070 GPR20: 000000010df4beb8 00007ffffd011404 000000010df4f8a0 000000010dfa89bc GPR24: 000000010dfa8a50 0000000000000000 000000011c30b720 0000000000000004 GPR28: 0000000000000004 00007fffa42418e0 000000011c30b720 0000000000000004 NIP [00007fffa413b034] 0x7fffa413b034 LR [00007fffa413b034] 0x7fffa413b034 --- interrupt: 3000