Processes stuck in uninterruptible sleep waiting on inflight IOs to return

Michael Marod <michael@xxxxxxxxxxxxxxxx> · Fri, 23 May 2025 19:41:53 +0000

Processes are occasionally stuck in uninterruptible sleep (D state) across a i3, i3en, and im4gn AWS EC2 instances under heavy IO load, forcing a reboot to fix. This has been observed on at least the 5.15.0-1083-aws and 6.8.0-1028-aws kernels. The local NVMe drives that come with the instances are in a RAID0 software raid using mdadm. AWS support says there is no issue with the underlying infrastructure and so I suspect there is a problem in the block layer of the kernel losing track of the IO response somehow and wedging the system. There was a recent CVE which sounded like it might be the problem but it turned out not to be it as the impacted instance’s kernel versions are "fixed" (https://ubuntu.com/security/CVE-2024-50082). If there is any more information I can provide other than what is included below that would provide some clarity on the situation please let me know.

The call traces on the stuck processes which have blk_* calls in them always include wbt_wait and look like this:

[163867.923558]  __schedule+0x2cd/0x890
[163867.923559]  ? blk_flush_plug_list+0xe3/0x110
[163867.923562]  schedule+0x69/0x110
[163867.923563]  io_schedule+0x16/0x40
[163867.923565]  ? wbt_cleanup_cb+0x20/0x20
[163867.923568]  rq_qos_wait+0xd0/0x170
[163867.923570]  ? __wbt_done+0x40/0x40
[163867.923571]  ? sysv68_partition+0x280/0x280
[163867.923572]  ? wbt_cleanup_cb+0x20/0x20
[163867.923574]  wbt_wait+0x96/0xc0
[163867.923576]  __rq_qos_throttle+0x28/0x40
[163867.923577]  blk_mq_submit_bio+0xfb/0x600

Or this:

[111397.863857]  __schedule+0x27c/0x680
[111397.863862]  ? __pfx_wbt_inflight_cb+0x10/0x10
[111397.863864]  schedule+0x2c/0xf0
[111397.863867]  io_schedule+0x46/0x80
[111397.863869]  rq_qos_wait+0xc1/0x160
[111397.863872]  ? __pfx_wbt_cleanup_cb+0x10/0x10
[111397.863873]  ? __pfx_rq_qos_wake_function+0x10/0x10
[111397.863876]  ? __pfx_wbt_inflight_cb+0x10/0x10
[111397.863880]  wbt_wait+0xb3/0x100
[111397.863883]  __rq_qos_throttle+0x28/0x40
[111397.863885]  blk_mq_submit_bio+0x151/0x740

The nvme smart-log and error-log have nothing interesting on any of the drives and the mdadm status all looks fine. Nothing interesting in dmesg other than the traces from the D state processes. None of the CPUs are doing work but the load avg is high because of CPU IO wait. Additionally iostat shows no activity.

There are a handful of IOs that are inflight and never return in each case.

# cat /sys/block/nvme3n1/inflight
       2       10

Snapshot of /proc/diskstats taken twice to show the stuck IOs in column 12 (IOs currently in progress)

# cat /proc/diskstats (non-nvme/md volumes redacted)
 259       0 nvme1n1 105729 10821 4012670 22218 17309 13374 623448 1836 5 38184 24054 0 0 0 0 0 0
 259       2 nvme2n1 104012 12803 4017102 163554751 17263 13207 610776 163522410 4 163004500 327077161 0 0 0 0 0 0
 259       1 nvme3n1 102827 7381 3877790 20710 16198 13096 577168 2373 12 37892 23084 0 0 0 0 0 0
 259       3 nvme0n1 135418 55608 6630967 31629 33645 23603 834048 3958 0 38220 35587 0 0 0 0 0 0
   9     127 md127 534397 0 18536329 461386 147709 0 2645600 957536 22 47436 1418923 0 0 0 0 0 0

# cat /proc/diskstats
 259       0 nvme1n1 105729 10821 4012670 22218 17309 13374 623448 1836 5 38184 24054 0 0 0 0 0 0
 259       2 nvme2n1 104012 12803 4017102 163554751 17263 13207 610776 163522410 4 163004500 327077161 0 0 0 0 0 0
 259       1 nvme3n1 102828 7381 3877790 20710 16198 13096 577168 2373 12 37896 23084 0 0 0 0 0 0
 259       3 nvme0n1 135428 55608 6631044 31630 33645 23603 834048 3958 0 38232 35588 0 0 0 0 0 0
   9     127 md127 534406 0 18536398 461390 147709 0 2645600 957536 22 47444 1418927 0 0 0 0 0 0

# cat /proc/meminfo shows that there is 5GB or so of dirty pages that are stuck and 270MB or so that are stuck in writeback. The write back number doesn’t move over time whereas the dirty pages fluctuates slightly (probably because of activity on the root volume)

MemTotal:       251504596 kB
MemFree:        224189644 kB
MemAvailable:   242048200 kB
Buffers:         1627368 kB
Cached:         16798708 kB
SwapCached:            0 kB
Active:          4911976 kB
Inactive:       19563000 kB
Active(anon):       2872 kB
Inactive(anon):  6055720 kB
Active(file):    4909104 kB
Inactive(file): 13507280 kB
Unevictable:       29328 kB
Mlocked:           19952 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:           5008372 kB
Writeback:        277224 kB
AnonPages:       6078340 kB
Mapped:           741700 kB
Shmem:              1192 kB
KReclaimable:    1640636 kB
Slab:            1964256 kB
SReclaimable:    1640636 kB
SUnreclaim:       323620 kB
KernelStack:       30768 kB
PageTables:        63536 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    125752296 kB
Committed_AS:    8918348 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       74008 kB
VmallocChunk:          0 kB
Percpu:           152064 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      415744 kB
DirectMap2M:    20555776 kB
DirectMap1G:    236978176 kB

I also traced wbt-timer which resulted in this being repeated with the latency number ever increasing:

    <redacted>    [005] ..s.. 166350.485548: wbt_lat: 259:1: latency 3144231651us
   <redacted>    [005] ..s.. 166350.485549: wbt_timer: 259:1: status=4, step=4, inflight=10
          <idle>-0       [016] ..s.. 166350.493550: wbt_lat: 259:0: latency 166251680941us
          <idle>-0       [016] ..s.. 166350.493551: wbt_timer: 259:0: status=4, step=4, inflight=3
          <idle>-0       [021] ..s.. 166350.529555: wbt_lat: 259:2: latency 166255836643us
          <idle>-0       [021] ..s.. 166350.529557: wbt_timer: 259:2: status=4, step=4, inflight=0

There is no block activity on the nvme drives or the md device.

Appreciate any help,

Michael Marod
michael@xxxxxxxxxxxxxxxx