Re: Memory reclaim and high nfsd usage

Rik Theys <rik.theys@xxxxxxxxx> · Tue, 1 Apr 2025 22:43:59 +0200

Hi,

On Tue, Apr 1, 2025 at 10:07 PM Rik Theys <rik.theys@xxxxxxxxx> wrote:
>
> Hi,
>
> On Tue, Apr 1, 2025 at 9:31 PM Rik Theys <rik.theys@xxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On Tue, Apr 1, 2025 at 7:40 PM Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> > > > Hi,
> > > >
> > > > Our fileserver is currently running 6.12.13 with the following 3
> > > > patches (from nfsd-testing) applied to it:
> > > >
> > > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> > > > - skip-sending-CB_RECALL_ANY
> > > > - fix-cb_getattr_status-fix
> > > >
> > > > Frequently the load on the system goes up and top shows a lot of
> > > > kswapd and kcompact threads next to nfsd threads. During these period
> > > > (which can last for hours), users complain about very slow NFS access.
> > > > We have approx 260 systems connecting to this server and the number of
> > > > nfs client states (from the states files in the clients directory) are
> > > > around 200000.
> > >
> > > Are any of these clients connecting to a server from the same host?
> > > Only reason I ask is I fixed a recursion deadlock that manifested in
> > > testing when memory was very low and LOCALIO used to loopback mount on
> > > the same host.  See:
> > >
> > > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
> > > https://git.kernel.org/linus/ce6d9c1c2b5cc785
> > >
> > > (I suspect you aren't using NFS loopback mounts at all otherwise your
> > > report would indicate breadcrumbs like I mentioned in my commit,
> > > e.g. "task kcompactd0:58 blocked for more than 4435 seconds").
> >
> > Normally the server does not NFS mount itself. We also don't have any
> > "blocked task" messages reported in dmesg.
> >
> > >
> > > > When I look at our monitoring logs, the system has frequent direct
> > > > reclaim stalls (allocstall_movable, and some allocstall_normal) and
> > > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> > > > is about 50. So it seems the system is out of memory and is constantly
> > > > trying to free pages? If I understand it correctly the system hits a
> > > > threshold which makes it scan for pages to free, frees some pages and
> > > > when it stops it very quickly hits the low watermark again?
> > > >
> > > > But the system has over 150G of memory dedicated to cache, and
> > > > slab_reclaim is only about 16G. Why is the system not dropping more
> > > > caches to free memory instead of constantly looking to free memory? Is
> > > > there a tunable that we can set so the system will prefer to drop
> > > > caches and increase memory usage for other nfsd related things? Any
> > > > tips on how to debug where the memory pressure is coming from, or why
> > > > the system decides to keep the pages used for cache instead of freeing
> > > > some of those?

Could this be related to
https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/commit/?h=linux-6.12.y&id=e21ce310556ec40b5b2987e02d12ca7109a33a61:

mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
commit 182db972c9568dc530b2f586a2f82dfd039d9f2a upstream.

This is fixed in a later 6.12.x kernel, but we're still running
6.12.13 currently.

Regards,
Rik

> >
> > The issue is currently not happening, but I've looked at some of our
> > sar statistics from today:
> >
> > # sar -B
> > 04:00:00 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s
> > pgscank/s pgscand/s pgsteal/s    %vmeff
> > 04:00:00 PM   6570.43  37504.61   1937.60      0.20 337274.24
> > 10817339.49      0.00  10623.60      0.10
> > 04:10:03 PM   6266.09  28821.33   4392.91      0.65 266336.28
> > 8464619.82      0.00   7756.98      0.09
> > 04:20:05 PM   6894.44  33790.76  12713.86      1.86 271167.36
> > 9689653.88      0.00   8123.21      0.08
> > 04:30:03 PM   6839.52  24451.70   1693.22      0.76 237536.27
> > 9268350.05     11.73   5339.54      0.06
> > 04:40:05 PM   6197.73  28958.02   4260.95      0.33 306245.10
> > 9797882.50      0.00   7892.46      0.08
> > 04:50:02 PM   4252.11  31658.28   1849.64      0.58 297727.92
> > 6885422.57      0.00   7541.08      0.11
> >
> > # sar -r
> > 04:00:00 PM kbmemfree   kbavail kbmemused  %memused kbbuffers
> > kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
> > 04:00:00 PM   3942896 180501232   2652336      1.35  29594476
> > 138477148   3949924      1.50  48038428 120797592     13324
> > 04:10:03 PM   4062416 180601484   2564852      1.31  29574180
> > 138589324   3974652      1.51  47664880 121277920    157472
> > 04:20:05 PM   4131172 180150888   3013128      1.54  29669384
> > 138076684   3969232      1.51  47325688 121184212      4448
> > 04:30:03 PM   4112388 180835756   2344936      1.20  30338956
> > 138145972   3883420      1.48  49014976 120205032      5072
> > 04:40:05 PM   3892332 179390408   3428992      1.75  30559972
> > 137103196   3852380      1.46  48939020 119461684    306336
> > 04:50:02 PM   4328220 180002072   3197120      1.63  30873116
> > 136567640   3891224      1.48  49335740 118841092      3412
> >
> > # sar -W
> > 04:00:00 PM  pswpin/s pswpout/s
> > 04:00:00 PM      0.09      0.29
> > 04:10:03 PM      0.33      0.60
> > 04:20:05 PM      0.20      0.38
> > 04:30:03 PM      0.69      0.33
> > 04:40:05 PM      0.36      0.72
> > 04:50:02 PM      0.30      0.46
> >
> > If I read this correctly, the systems is scanning scanning for free
> > pages (pgscand) and freeing some of them (pgfree), but the efficiency
> > is low (%vmeff).
> > At the same time, the amount of memory used (kbmemused / %memused) is
> > quite low as most of the memory is used as cache. There's approx 120G
> > of inactive memory.
> > So I'm at loss as to why the system is performing these page scans and
> > stalling instead of dropping some of the cache and using that instead.
> >
> > >
> > > All good questions, to which I don't have immediate answers (but
> > > others may).
> > >
> > > Just FYI: there is a slow-start development TODO to leverage 6.14's
> > > "DONTCACHE" support (particularly in nfsd, but client might benefit
> > > some too) to avoid nfsd writeback stalls due to memory being
> > > fragmented and reclaim having to work too hard (in concert with
> > > kcompactd) to find adequate pages.
> > >
> > > > I've ran a perf record for 10s and the top 4 of the events seem to be:
> > > >
> > > > 1. 54% is swapper in intel_idle_ibrs
> > > > 2. 12% is swapper in intel_idle
> > > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> > > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report
> > >
> > > 10s is pretty short... might consider a longer sample and then use the
> > > perf.data to generate a flamegraph, e.g.:
> > >
> > > - Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
> > >   you will likely need to install some missing deps, e.g.:
> > >   yum install perl-open.noarch
> > > - export FLAME=/root/git/FlameGraph
> > > - perf record -F 99 -a -g sleep 120
> > >   - this will generate a perf.data output file.
> > >
> > > Once you have perf.data output, generate a flamegraph file (named
> > > perf.svg) using these 2 commands:
> > > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
> > > $FLAME/flamegraph.pl out.perf-folded > perf.svg
> > >
> > > Open the perf.svg image with your favorite image viewer (a web browser
> > > works well).
> > >
> > > I just find flamegraph way more useful than 'perf report' ranked
> > > ordering.
> >
> > That's a very good idea, thanks. I will try that when the issue returns.
>
> The kswapd process started to consume some cpu again, so I've followed
> this procedure. See the file in attach.
>
> Does this show some sort of locking contention?
>
> Regards,
> Rik
>
> >
> > >
> > > > Are there any know memory management changes related to NFS that have
> > > > been introduced that could explain this behavior? What steps can I
> > > > take to debug the root cause of this? Looking at iftop there isn't
> > > > much going on regarding throughput. The top 3 NFS4 server operations
> > > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s).
> > >
> > > You'd likely do well to expand the audience to include MM too (now cc'd).
> >
> > Thanks. All ideas on how I can determine the root cause of this is appreciated.
> >
> >
> > Regards,
> > Rik
>
>
>
> --
>
> Rik

-- 

Rik