On Fri, Jul 11, 2025 at 5:10 PM David Howells <dhowells@xxxxxxxxxx> wrote: > > The netfs copy-to-cache that is used by Ceph with local caching sets up a > new request to write data just read to the cache. The request is started > and then left to look after itself whilst the app continues. The request > gets notified by the backing fs upon completion of the async DIO write, but > then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't > set - but the app isn't waiting there, and so the request just hangs. > > Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the > notification from the backing filesystem to put the collection onto a work > queue instead. Thanks David, you can add me as Tested-by if you want. I can't test the other patch for the next two weeks (vacation). When I'm back, I'll install both fixes on some heavily loaded production machines - our clusters always shake out the worst in every piece of code they run!