Processes are occasionally stuck in uninterruptible sleep (D state) across a i3, i3en, and im4gn AWS EC2 instances under heavy IO load, forcing a reboot to fix. This has been observed on at least the 5.15.0-1083-aws and 6.8.0-1028-aws kernels. The local NVMe drives that come with the instances are in a RAID0 software raid using mdadm. AWS support says there is no issue with the underlying infrastructure and so I suspect there is a problem in the block layer of the kernel losing track of the IO response somehow and wedging the system. There was a recent CVE which sounded like it might be the problem but it turned out not to be it as the impacted instance’s kernel versions are "fixed" (https://ubuntu.com/security/CVE-2024-50082). If there is any more information I can provide other than what is included below that would provide some clarity on the situation please let me know. The call traces on the stuck processes which have blk_* calls in them always include wbt_wait and look like this: [163867.923558] __schedule+0x2cd/0x890 [163867.923559] ? blk_flush_plug_list+0xe3/0x110 [163867.923562] schedule+0x69/0x110 [163867.923563] io_schedule+0x16/0x40 [163867.923565] ? wbt_cleanup_cb+0x20/0x20 [163867.923568] rq_qos_wait+0xd0/0x170 [163867.923570] ? __wbt_done+0x40/0x40 [163867.923571] ? sysv68_partition+0x280/0x280 [163867.923572] ? wbt_cleanup_cb+0x20/0x20 [163867.923574] wbt_wait+0x96/0xc0 [163867.923576] __rq_qos_throttle+0x28/0x40 [163867.923577] blk_mq_submit_bio+0xfb/0x600 Or this: [111397.863857] __schedule+0x27c/0x680 [111397.863862] ? __pfx_wbt_inflight_cb+0x10/0x10 [111397.863864] schedule+0x2c/0xf0 [111397.863867] io_schedule+0x46/0x80 [111397.863869] rq_qos_wait+0xc1/0x160 [111397.863872] ? __pfx_wbt_cleanup_cb+0x10/0x10 [111397.863873] ? __pfx_rq_qos_wake_function+0x10/0x10 [111397.863876] ? __pfx_wbt_inflight_cb+0x10/0x10 [111397.863880] wbt_wait+0xb3/0x100 [111397.863883] __rq_qos_throttle+0x28/0x40 [111397.863885] blk_mq_submit_bio+0x151/0x740 The nvme smart-log and error-log have nothing interesting on any of the drives and the mdadm status all looks fine. Nothing interesting in dmesg other than the traces from the D state processes. None of the CPUs are doing work but the load avg is high because of CPU IO wait. Additionally iostat shows no activity. There are a handful of IOs that are inflight and never return in each case. # cat /sys/block/nvme3n1/inflight 2 10 Snapshot of /proc/diskstats taken twice to show the stuck IOs in column 12 (IOs currently in progress) # cat /proc/diskstats (non-nvme/md volumes redacted) 259 0 nvme1n1 105729 10821 4012670 22218 17309 13374 623448 1836 5 38184 24054 0 0 0 0 0 0 259 2 nvme2n1 104012 12803 4017102 163554751 17263 13207 610776 163522410 4 163004500 327077161 0 0 0 0 0 0 259 1 nvme3n1 102827 7381 3877790 20710 16198 13096 577168 2373 12 37892 23084 0 0 0 0 0 0 259 3 nvme0n1 135418 55608 6630967 31629 33645 23603 834048 3958 0 38220 35587 0 0 0 0 0 0 9 127 md127 534397 0 18536329 461386 147709 0 2645600 957536 22 47436 1418923 0 0 0 0 0 0 # cat /proc/diskstats 259 0 nvme1n1 105729 10821 4012670 22218 17309 13374 623448 1836 5 38184 24054 0 0 0 0 0 0 259 2 nvme2n1 104012 12803 4017102 163554751 17263 13207 610776 163522410 4 163004500 327077161 0 0 0 0 0 0 259 1 nvme3n1 102828 7381 3877790 20710 16198 13096 577168 2373 12 37896 23084 0 0 0 0 0 0 259 3 nvme0n1 135428 55608 6631044 31630 33645 23603 834048 3958 0 38232 35588 0 0 0 0 0 0 9 127 md127 534406 0 18536398 461390 147709 0 2645600 957536 22 47444 1418927 0 0 0 0 0 0 # cat /proc/meminfo shows that there is 5GB or so of dirty pages that are stuck and 270MB or so that are stuck in writeback. The write back number doesn’t move over time whereas the dirty pages fluctuates slightly (probably because of activity on the root volume) MemTotal: 251504596 kB MemFree: 224189644 kB MemAvailable: 242048200 kB Buffers: 1627368 kB Cached: 16798708 kB SwapCached: 0 kB Active: 4911976 kB Inactive: 19563000 kB Active(anon): 2872 kB Inactive(anon): 6055720 kB Active(file): 4909104 kB Inactive(file): 13507280 kB Unevictable: 29328 kB Mlocked: 19952 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 5008372 kB Writeback: 277224 kB AnonPages: 6078340 kB Mapped: 741700 kB Shmem: 1192 kB KReclaimable: 1640636 kB Slab: 1964256 kB SReclaimable: 1640636 kB SUnreclaim: 323620 kB KernelStack: 30768 kB PageTables: 63536 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 125752296 kB Committed_AS: 8918348 kB VmallocTotal: 34359738367 kB VmallocUsed: 74008 kB VmallocChunk: 0 kB Percpu: 152064 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 415744 kB DirectMap2M: 20555776 kB DirectMap1G: 236978176 kB I also traced wbt-timer which resulted in this being repeated with the latency number ever increasing: <redacted> [005] ..s.. 166350.485548: wbt_lat: 259:1: latency 3144231651us <redacted> [005] ..s.. 166350.485549: wbt_timer: 259:1: status=4, step=4, inflight=10 <idle>-0 [016] ..s.. 166350.493550: wbt_lat: 259:0: latency 166251680941us <idle>-0 [016] ..s.. 166350.493551: wbt_timer: 259:0: status=4, step=4, inflight=3 <idle>-0 [021] ..s.. 166350.529555: wbt_lat: 259:2: latency 166255836643us <idle>-0 [021] ..s.. 166350.529557: wbt_timer: 259:2: status=4, step=4, inflight=0 There is no block activity on the nvme drives or the md device. Appreciate any help, Michael Marod michael@xxxxxxxxxxxxxxxx