[Apologies for so many words...] Hi, I wanted to get this on all the NFS and NFSD maintainers' radar ASAP. I realize the timing of this is not great due to how late we are in the 6.16 release cycle (v6.16-rc7). But I feel it prudent to make it clear that the LOCALIO changes that went upstream during the 6.16 merge window are unstable under load. So this week we'll need to make a call on how to handle this for v6.16 final. And just FYI: I unfortunately don't have time this week to assist with developing/testing a smaller fix to solve this situation. The window for extensive testing (by myself and others at Hammerspace) was late last week. At this point, given we are short on time, reverting is the sane thing to do. Also, the 6.16-rc7 release's LOCALIO changes put it on something of an island relative to more enterprise production kernels I am helping to maintain (both the RHEL10 kernel and Oracle's OCI kernel, which is actually an Ubuntu kernel, both have NFS LOCALIO that is 6.14 based). All that said: The past few weeks I had to assist with an HPC benchmarking effort that generates heavy load using the "MLperf" benchmark suite. Testing was done on 10 enterprise grade NVMe storage systems (each with 48 CPUs, and 8 NVMe devices) that depend on LOCALIO to "just work _well_" to achieve a favorable score. Unfortunately LOCALIO didn't, so I got to reverting. I started with this partial revert patch but it wasn't enough (it just made the problem harder to hit), labeling this previous revert proposal as "RFC" rather than "URGENT" was a mistake: https://lore.kernel.org/linux-nfs/aG0pJXVtApZ9C5vy@xxxxxxxxxx/ (which is very similar to patch 2 in this series) It wasn't until I did a full revert of 6.16's LOCALIO changes that LOCALIO stopped having resource leaks (nfsd_file in particular) that prevented proper NFSD shutdown and the inability to unload nfsd.ko.ko (which I had to do a lot of while developing other NFS and NFSD changes that were unrelated to LOCALIO). Neil, I value the work you did to try to address the lingering complaints about RCU related compiler errors in LOCALIO (but when you posted your changes months ago I didn't have time to review, and then they went upstream; so I assumed they were ready and made sure to include them in Hammerspace's more recent kernels so that I could gain "production" confidence in the changes even though I still hadn't had time to review them properly.. ugh). Glad "we" did this heavy load testing because otherwise we'd be oblivious about LOCALIO changes merged for 6.16 causing regression. (I'm sending this later on my Sunday evening in the hopes that you being in Australia enables us to not lose a day of communication on this situation). Patch 2 gets into how simple it is to trigger the nfsd_file leaks resulting from running fio followed by NFSD shutdown and nfsd.ko module removal. Regards, Mike Mike Snitzer (9): Revert "NFSD: Clean up kdoc for nfsd_open_local_fh()" Revert "nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointer" Revert "nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()" Revert "nfs_localio: duplicate nfs_close_local_fh()" Revert "nfs_localio: simplify interface to nfsd for getting nfsd_file" Revert "nfs_localio: always hold nfsd net ref with nfsd_file ref" Revert "nfs_localio: use cmpxchg() to install new nfs_file_localio" nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local() nfs/localio: add localio_async_probe modparm fs/nfs/localio.c | 64 ++++++++++++++++-------- fs/nfs_common/nfslocalio.c | 99 +++++++++++++------------------------- fs/nfsd/filecache.c | 34 ++----------- fs/nfsd/filecache.h | 3 +- fs/nfsd/localio.c | 44 ++--------------- include/linux/nfslocalio.h | 26 +++++----- 6 files changed, 100 insertions(+), 170 deletions(-) -- 2.44.0