Re: Memory reclaim and high nfsd usage

Mike Snitzer <snitzer@xxxxxxxxxx> · Tue, 1 Apr 2025 13:40:10 -0400

On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> Hi,
> 
> Our fileserver is currently running 6.12.13 with the following 3
> patches (from nfsd-testing) applied to it:
> 
> - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> - skip-sending-CB_RECALL_ANY
> - fix-cb_getattr_status-fix
> 
> Frequently the load on the system goes up and top shows a lot of
> kswapd and kcompact threads next to nfsd threads. During these period
> (which can last for hours), users complain about very slow NFS access.
> We have approx 260 systems connecting to this server and the number of
> nfs client states (from the states files in the clients directory) are
> around 200000.

Are any of these clients connecting to a server from the same host?
Only reason I ask is I fixed a recursion deadlock that manifested in
testing when memory was very low and LOCALIO used to loopback mount on
the same host.  See:

ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
https://git.kernel.org/linus/ce6d9c1c2b5cc785

(I suspect you aren't using NFS loopback mounts at all otherwise your
report would indicate breadcrumbs like I mentioned in my commit,
e.g. "task kcompactd0:58 blocked for more than 4435 seconds").

> When I look at our monitoring logs, the system has frequent direct
> reclaim stalls (allocstall_movable, and some allocstall_normal) and
> pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> is about 50. So it seems the system is out of memory and is constantly
> trying to free pages? If I understand it correctly the system hits a
> threshold which makes it scan for pages to free, frees some pages and
> when it stops it very quickly hits the low watermark again?
> 
> But the system has over 150G of memory dedicated to cache, and
> slab_reclaim is only about 16G. Why is the system not dropping more
> caches to free memory instead of constantly looking to free memory? Is
> there a tunable that we can set so the system will prefer to drop
> caches and increase memory usage for other nfsd related things? Any
> tips on how to debug where the memory pressure is coming from, or why
> the system decides to keep the pages used for cache instead of freeing
> some of those?

All good questions, to which I don't have immediate answers (but
others may).

Just FYI: there is a slow-start development TODO to leverage 6.14's
"DONTCACHE" support (particularly in nfsd, but client might benefit
some too) to avoid nfsd writeback stalls due to memory being
fragmented and reclaim having to work too hard (in concert with
kcompactd) to find adequate pages.

> I've ran a perf record for 10s and the top 4 of the events seem to be:
> 
> 1. 54% is swapper in intel_idle_ibrs
> 2. 12% is swapper in intel_idle
> 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> 4. 5% is kswapd0 in __list_del_entry_valid_or_report

10s is pretty short... might consider a longer sample and then use the
perf.data to generate a flamegraph, e.g.:

- Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
  you will likely need to install some missing deps, e.g.:
  yum install perl-open.noarch
- export FLAME=/root/git/FlameGraph
- perf record -F 99 -a -g sleep 120
  - this will generate a perf.data output file.

Once you have perf.data output, generate a flamegraph file (named
perf.svg) using these 2 commands:
perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
$FLAME/flamegraph.pl out.perf-folded > perf.svg

Open the perf.svg image with your favorite image viewer (a web browser
works well).

I just find flamegraph way more useful than 'perf report' ranked
ordering.

> Are there any know memory management changes related to NFS that have
> been introduced that could explain this behavior? What steps can I
> take to debug the root cause of this? Looking at iftop there isn't
> much going on regarding throughput. The top 3 NFS4 server operations
> are sequence 9563/s), putfh(9032/s) and getattr (7150/s).

You'd likely do well to expand the audience to include MM too (now cc'd).