Max Kellermann <max.kellermann@xxxxxxxxx> wrote: > your commit 2b1424cd131c ("netfs: Fix wait/wake to be consistent about > the waitqueue used") has given me serious headaches; it has caused > outages in our web hosting clusters (yet again - all Linux versions > since 6.9 had serious netfs regressions). Your patch was backported to > 6.15 as commit 329ba1cb402a in 6.15.3 (why oh why??), and therefore > the bugs it has caused will be "available" to all Linux stable users. > > The problem we had is that writing to certain files never finishes. It > looks like it has to do with the cachefiles subrequest never reporting > completion. (We use Ceph with cachefiles) > > I have tried applying the fixes in this pull request, which sounded > promising, but the problem is still there. The only thing that helps > is reverting 2b1424cd131c completely - everything is fine with 6.15.5 > plus the revert. > > What do you need from me in order to analyze the bug? As a start, can you turn on: echo 65536 >/sys/kernel/debug/tracing/buffer_size_kb echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_failure/enable If you keep an eye on /proc/fs/netfs/requests you should be able to see any tasks in there that get stuck. If one gets stuck, then: echo 0 > /sys/kernel/debug/tracing/events/enable to stop further tracing. Looking in /proc/fs/netfs/requests, you should be able to see the debug ID of the stuck request. If you can try grepping the trace log for that: grep "R=<8-digit-hex-id>" /sys/kernel/debug/tracing/trace that should hopefully let me see how things progressed on that call. David