blktests failures with v6.15 kernel

Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> · Thu, 29 May 2025 08:46:35 +0000

Hi all,

I ran the latest blktests (git hash: 283923df5bee) with the v6.15 kernel. I
observed 6 failures listed below. Comparing with the previous report with the
v6.15-rc1 kernel [1], 2 failures are no longer observed (rxe driver test hang
and nvme/037), and 4 new failures are observed (nvme/023, nvme/061 hang and
failure, nvme/063 failure).

[1] https://lore.kernel.org/linux-block/x2gnkogq46h66r2fctksnu4yu4wpndkopawbsudq6vqbcgjszu@fjrowpmrran5/

List of failures
================
#1: nvme/023
#2: nvme/041 (fc transport)
#3: nvme/061 hang (rdma transport, siw driver)
#4: nvme/061 failure (fc transport)
#5: nvme/063 failure (tcp transport)
#6: q_usage_counter WARN during system boot

Failure description
===================

#1: nvme/023

    When libnvme has version 1.13 or later and built with liburing, nvme-cli
    command "nvme smart-log" command fails for namespace block devices. This
    makes the test case nvme/032 fail [2]. Fix in libnvme is expected.

    [2] https://lore.kernel.org/linux-nvme/32c3e9ef-ab3c-40b5-989a-7aa323f5d611@flourine.local/T/#m6519ce3e641e7011231d955d9002d1078510e3ee

#2: nvme/041 (fc transport)

    The test case nvme/041 fails for fc transport. Refer to the report for v6.12
    kernel [3].

    [3] https://lore.kernel.org/linux-nvme/6crydkodszx5vq4ieox3jjpwkxtu7mhbohypy24awlo5w7f4k6@to3dcng24rd4/

#3: nvme/061 hang (rdma transport, siw driver)

    The new test case nvme/061 revealed a bug in RDMA core, which causes
    KASAN slab-use-after-free of cm_id_private work objects. A fix patch is
    queued for v6.16-rcX [4].

    [4] https://lore.kernel.org/linux-rdma/20250510101036.1756439-1-shinichiro.kawasaki@xxxxxxx/

#4: nvme/061 failure (fc transport)

    The test case nvme/061 sometimes fails due to a WARN [5]. Just before the
    WARN, The kernel reported "refcount_t: underflow; use-after-free." This
    failure can be recreated in stable manner by repeating the test case 10
    times or so.

    I tried v6.15-rcX kernels. When I ran v6.15-rc1 kernel, the test case always
    failed with different symptom. With v6.15-rc2 kernel, the test case passed
    in most runs, but sometimes it failed with the same symptom as v6.15. I
    guess the nvme-fc changes in v6.15-rc2 fixed most of the refcounting issue,
    but still rare refcounting failure scenario is left.

#5: nvme/063 failure (tcp transport)

    The new test case nvme/063 triggers a WARN in blk_mq_unquiesce_queue() and
    KASAN slab-use-after-free in blk_mq_queue_tag_busy_iter() [6]. Some debug
    effort was made, but it still needs further work.

    [6] https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/

#6: q_usage_counter WARN during system boot

    This is not a blktests failure, but I observe it on test systems for
    blktests. During the system boot process, a lockdep WARN relevant to
    q_usage_counter. Refer to the report for v6.15-rc1 [1].

[5] dmesg at nvme/061 failure

[65984.926261] [  T26143] run blktests nvme/061 at 2025-05-29 14:38:34
[65984.980383] [  T26188] loop0: detected capacity change from 0 to 2097152
[65984.995441] [  T26191] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[65985.050303] [  T23244] nvme nvme1: NVME-FC{0}: create association : host wwpn 0x20001100aa000001  rport wwpn 0x20001100ab000001: NQN "blktests-subsystem-1"
[65985.052545] [  T23343] (NULL device *): {0:0} Association created
[65985.053586] [  T25919] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[65985.059926] [  T23244] nvme nvme1: NVME-FC{0}: controller connect complete
[65985.061770] [  T26214] nvme nvme1: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1", hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[65985.125347] [  T23936] nvme nvme2: NVME-FC{1}: create association : host wwpn 0x20001100aa000001  rport wwpn 0x20001100ab000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
[65985.128362] [   T4511] (NULL device *): {0:1} Association created
[65985.130389] [  T23342] nvmet: Created discovery controller 2 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:3a8a427d-68a5-4129-8b0f-1a53fd94be80.
[65985.133718] [  T23936] nvme nvme2: NVME-FC{1}: controller connect complete
[65985.134599] [  T26217] nvme nvme2: NVME-FC{1}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", hostnqn: nqn.2014-08.org.nvmexpress:uuid:3a8a427d-68a5-4129-8b0f-1a53fd94be80
[65985.139708] [  T26217] nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[65985.153785] [   T4511] (NULL device *): {0:1} Association deleted
[65985.164940] [   T4511] (NULL device *): {0:1} Association freed
[65985.166099] [  T25142] (NULL device *): Disconnect LS failed: No Association
[65986.133054] [   T4511] nvme nvme1: NVME-FC{0}: io failed due to lldd error -107
[65986.133073] [  T25919] nvme nvme1: NVME-FC{0}: io failed due to lldd error -107
[65986.133502] [  T23343] nvme nvme1: NVME-FC{0}: io failed due to lldd error -107
[65986.133519] [  T23936] nvme nvme1: NVME-FC{0}: transport association event: transport detected io error
[65986.133524] [  T23936] nvme nvme1: NVME-FC{0}: resetting controller
[65986.133530] [  T23936] nvme nvme1: NVME-FC{0}: io failed due to lldd error -107
[65986.133546] [  T15792] block nvme1n1: no usable path - requeuing I/O
[65986.133576] [  T26241] block nvme1n1: no usable path - requeuing I/O
[65986.133925] [   T1217] block nvme1n1: no usable path - requeuing I/O
[65986.145862] [  T23342] (NULL device *): {0:0} Association deleted
[65986.160121] [   T4511] nvme nvme1: NVME-FC{0}: create association : host wwpn 0x20001100aa000001  rport wwpn 0x20001100ab000001: NQN "blktests-subsystem-1"
[65986.162170] [   T4511] (NULL device *): queue 0 connect admin queue failed (-111).
[65986.163062] [   T4511] nvme nvme1: NVME-FC{0}: reset: Reconnect attempt failed (-111)
[65986.163065] [   T4511] nvme nvme1: NVME-FC{0}: Reconnect attempt in 1 seconds
[65986.189933] [  T23342] (NULL device *): {0:0} Association freed
[65986.190779] [  T15160] (NULL device *): Disconnect LS failed: No Association
[65986.191973] [  T23342] ------------[ cut here ]------------
[65986.192759] [  T23342] refcount_t: underflow; use-after-free.
[65986.193537] [  T23342] WARNING: CPU: 3 PID: 23342 at lib/refcount.c:28 refcount_warn_saturate+0xee/0x150
[65986.194436] [  T23342] Modules linked in: nvme_fcloop nvmet_fc nvmet nvme_fc nvme_fabrics chacha_generic chacha20poly1305 tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet netfs parport_pc parport i2c_piix4 i2c_smbus e1000 pcspkr fuse loop dm_multipath nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci zram bochs drm_client_lib drm_shmem_helper drm_kms_helper xfs nvme drm sym53c8xx scsi_transport_spi nvme_core nvme_keyring serio_raw nvme_auth floppy ata_generic pata_acpi qemu_fw_cfg [last unloaded: nvmet]
[65986.200276] [  T23342] CPU: 3 UID: 0 PID: 23342 Comm: kworker/u16:5 Tainted: G    B               6.15.0+ #41 PREEMPT(voluntary) 
[65986.201617] [  T23342] Tainted: [B]=BAD_PAGE
[65986.202522] [  T23342] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[65986.203723] [  T23342] Workqueue: nvmet-wq nvmet_fc_delete_assoc_work [nvmet_fc]
[65986.204774] [  T23342] RIP: 0010:refcount_warn_saturate+0xee/0x150
[65986.205754] [  T23342] Code: 24 27 3f 03 01 e8 b2 e1 cd fe 0f 0b eb 91 80 3d 13 27 3f 03 00 75 88 48 c7 c7 a0 e8 3c 87 c6 05 03 27 3f 03 01 e8 92 e1 cd fe <0f> 0b e9 6e ff ff ff 80 3d f3 26 3f 03 00 0f 85 61 ff ff ff 48 c7
[65986.208055] [  T23342] RSP: 0018:ffff88811cf37c28 EFLAGS: 00010296
[65986.209072] [  T23342] RAX: 0000000000000000 RBX: ffff888106198440 RCX: 0000000000000000
[65986.210118] [  T23342] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
[65986.211162] [  T23342] RBP: 0000000000000003 R08: 0000000000000001 R09: ffffed1075c35981
[65986.212215] [  T23342] R10: ffff8883ae1acc0b R11: fffffffffffd4e60 R12: ffff888109d62938
[65986.213268] [  T23342] R13: ffff888106198440 R14: ffff88812cc3883c R15: ffff888106198448
[65986.214361] [  T23342] FS:  0000000000000000(0000) GS:ffff8884245bd000(0000) knlGS:0000000000000000
[65986.215467] [  T23342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[65986.216458] [  T23342] CR2: 00007f66ec449c00 CR3: 000000012ffcc000 CR4: 00000000000006f0
[65986.217479] [  T23342] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[65986.218476] [  T23342] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[65986.219437] [  T23342] Call Trace:
[65986.220202] [  T23342]  <TASK>
[65986.220942] [  T23342]  nvmet_fc_delete_assoc_work+0xf1/0x2d0 [nvmet_fc]
[65986.221821] [  T23342]  process_one_work+0x84f/0x1460
[65986.222663] [  T23342]  ? __pfx_process_one_work+0x10/0x10
[65986.223481] [  T23342]  ? assign_work+0x16c/0x240
[65986.224301] [  T23342]  worker_thread+0x5ef/0xfd0
[65986.225094] [  T23342]  ? __kthread_parkme+0xb4/0x200
[65986.225930] [  T23342]  ? __pfx_worker_thread+0x10/0x10
[65986.226722] [  T23342]  kthread+0x3b0/0x770
[65986.227494] [  T23342]  ? __pfx_kthread+0x10/0x10
[65986.228324] [  T23342]  ? rcu_is_watching+0x11/0xb0
[65986.229152] [  T23342]  ? _raw_spin_unlock_irq+0x24/0x50
[65986.229970] [  T23342]  ? rcu_is_watching+0x11/0xb0
[65986.230747] [  T23342]  ? __pfx_kthread+0x10/0x10
[65986.231527] [  T23342]  ret_from_fork+0x30/0x70
[65986.232295] [  T23342]  ? __pfx_kthread+0x10/0x10
[65986.233081] [  T23342]  ret_from_fork_asm+0x1a/0x30
[65986.233863] [  T23342]  </TASK>
[65986.234571] [  T23342] irq event stamp: 0
[65986.235279] [  T23342] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[65986.236195] [  T23342] hardirqs last disabled at (0): [<ffffffff844f4e98>] copy_process+0x1f08/0x87c0
[65986.237174] [  T23342] softirqs last  enabled at (0): [<ffffffff844f4efd>] copy_process+0x1f6d/0x87c0
[65986.238085] [  T23342] softirqs last disabled at (0): [<0000000000000000>] 0x0
[65986.238945] [  T23342] ---[ end trace 0000000000000000 ]---
[65986.243357] [  T26143] nvme nvme1: NVME-FC{0}: controller connectivity lost. Awaiting Reconnect
[65986.258391] [  T26255] nvme_fc: nvme_fc_create_ctrl: nn-0x10001100ab000001:pn-0x20001100ab000001 - nn-0x10001100aa000001:pn-0x20001100aa000001 combination not found
...