Re: [PATCH 00/13] netfs, cifs: Fixes to retry-related code

David Howells <dhowells@xxxxxxxxxx> · Wed, 09 Jul 2025 14:01:37 +0100

Max Kellermann <max.kellermann@xxxxxxxxx> wrote:

> your commit 2b1424cd131c ("netfs: Fix wait/wake to be consistent about
> the waitqueue used") has given me serious headaches; it has caused
> outages in our web hosting clusters (yet again - all Linux versions
> since 6.9 had serious netfs regressions). Your patch was backported to
> 6.15 as commit 329ba1cb402a in 6.15.3 (why oh why??), and therefore
> the bugs it has caused will be "available" to all Linux stable users.
> 
> The problem we had is that writing to certain files never finishes. It
> looks like it has to do with the cachefiles subrequest never reporting
> completion. (We use Ceph with cachefiles)
> 
> I have tried applying the fixes in this pull request, which sounded
> promising, but the problem is still there. The only thing that helps
> is reverting 2b1424cd131c completely - everything is fine with 6.15.5
> plus the revert.
> 
> What do you need from me in order to analyze the bug?

As a start, can you turn on:

echo 65536 >/sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_failure/enable

If you keep an eye on /proc/fs/netfs/requests you should be able to see any
tasks in there that get stuck.  If one gets stuck, then:

echo 0 > /sys/kernel/debug/tracing/events/enable

to stop further tracing.

Looking in /proc/fs/netfs/requests, you should be able to see the debug ID of
the stuck request.  If you can try grepping the trace log for that:

grep "R=<8-digit-hex-id>" /sys/kernel/debug/tracing/trace

that should hopefully let me see how things progressed on that call.

David