Re: Memory reclaim and high nfsd usage

Rik Theys <rik.theys@xxxxxxxxx> · Tue, 1 Apr 2025 22:07:30 +0200

Hi,

On Tue, Apr 1, 2025 at 9:31 PM Rik Theys <rik.theys@xxxxxxxxx> wrote:
>
> Hi,
>
> On Tue, Apr 1, 2025 at 7:40 PM Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> >
> > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote:
> > > Hi,
> > >
> > > Our fileserver is currently running 6.12.13 with the following 3
> > > patches (from nfsd-testing) applied to it:
> > >
> > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr
> > > - skip-sending-CB_RECALL_ANY
> > > - fix-cb_getattr_status-fix
> > >
> > > Frequently the load on the system goes up and top shows a lot of
> > > kswapd and kcompact threads next to nfsd threads. During these period
> > > (which can last for hours), users complain about very slow NFS access.
> > > We have approx 260 systems connecting to this server and the number of
> > > nfs client states (from the states files in the clients directory) are
> > > around 200000.
> >
> > Are any of these clients connecting to a server from the same host?
> > Only reason I ask is I fixed a recursion deadlock that manifested in
> > testing when memory was very low and LOCALIO used to loopback mount on
> > the same host.  See:
> >
> > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback")
> > https://git.kernel.org/linus/ce6d9c1c2b5cc785
> >
> > (I suspect you aren't using NFS loopback mounts at all otherwise your
> > report would indicate breadcrumbs like I mentioned in my commit,
> > e.g. "task kcompactd0:58 blocked for more than 4435 seconds").
>
> Normally the server does not NFS mount itself. We also don't have any
> "blocked task" messages reported in dmesg.
>
> >
> > > When I look at our monitoring logs, the system has frequent direct
> > > reclaim stalls (allocstall_movable, and some allocstall_normal) and
> > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly
> > > is about 50. So it seems the system is out of memory and is constantly
> > > trying to free pages? If I understand it correctly the system hits a
> > > threshold which makes it scan for pages to free, frees some pages and
> > > when it stops it very quickly hits the low watermark again?
> > >
> > > But the system has over 150G of memory dedicated to cache, and
> > > slab_reclaim is only about 16G. Why is the system not dropping more
> > > caches to free memory instead of constantly looking to free memory? Is
> > > there a tunable that we can set so the system will prefer to drop
> > > caches and increase memory usage for other nfsd related things? Any
> > > tips on how to debug where the memory pressure is coming from, or why
> > > the system decides to keep the pages used for cache instead of freeing
> > > some of those?
>
> The issue is currently not happening, but I've looked at some of our
> sar statistics from today:
>
> # sar -B
> 04:00:00 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s
> pgscank/s pgscand/s pgsteal/s    %vmeff
> 04:00:00 PM   6570.43  37504.61   1937.60      0.20 337274.24
> 10817339.49      0.00  10623.60      0.10
> 04:10:03 PM   6266.09  28821.33   4392.91      0.65 266336.28
> 8464619.82      0.00   7756.98      0.09
> 04:20:05 PM   6894.44  33790.76  12713.86      1.86 271167.36
> 9689653.88      0.00   8123.21      0.08
> 04:30:03 PM   6839.52  24451.70   1693.22      0.76 237536.27
> 9268350.05     11.73   5339.54      0.06
> 04:40:05 PM   6197.73  28958.02   4260.95      0.33 306245.10
> 9797882.50      0.00   7892.46      0.08
> 04:50:02 PM   4252.11  31658.28   1849.64      0.58 297727.92
> 6885422.57      0.00   7541.08      0.11
>
> # sar -r
> 04:00:00 PM kbmemfree   kbavail kbmemused  %memused kbbuffers
> kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
> 04:00:00 PM   3942896 180501232   2652336      1.35  29594476
> 138477148   3949924      1.50  48038428 120797592     13324
> 04:10:03 PM   4062416 180601484   2564852      1.31  29574180
> 138589324   3974652      1.51  47664880 121277920    157472
> 04:20:05 PM   4131172 180150888   3013128      1.54  29669384
> 138076684   3969232      1.51  47325688 121184212      4448
> 04:30:03 PM   4112388 180835756   2344936      1.20  30338956
> 138145972   3883420      1.48  49014976 120205032      5072
> 04:40:05 PM   3892332 179390408   3428992      1.75  30559972
> 137103196   3852380      1.46  48939020 119461684    306336
> 04:50:02 PM   4328220 180002072   3197120      1.63  30873116
> 136567640   3891224      1.48  49335740 118841092      3412
>
> # sar -W
> 04:00:00 PM  pswpin/s pswpout/s
> 04:00:00 PM      0.09      0.29
> 04:10:03 PM      0.33      0.60
> 04:20:05 PM      0.20      0.38
> 04:30:03 PM      0.69      0.33
> 04:40:05 PM      0.36      0.72
> 04:50:02 PM      0.30      0.46
>
> If I read this correctly, the systems is scanning scanning for free
> pages (pgscand) and freeing some of them (pgfree), but the efficiency
> is low (%vmeff).
> At the same time, the amount of memory used (kbmemused / %memused) is
> quite low as most of the memory is used as cache. There's approx 120G
> of inactive memory.
> So I'm at loss as to why the system is performing these page scans and
> stalling instead of dropping some of the cache and using that instead.
>
> >
> > All good questions, to which I don't have immediate answers (but
> > others may).
> >
> > Just FYI: there is a slow-start development TODO to leverage 6.14's
> > "DONTCACHE" support (particularly in nfsd, but client might benefit
> > some too) to avoid nfsd writeback stalls due to memory being
> > fragmented and reclaim having to work too hard (in concert with
> > kcompactd) to find adequate pages.
> >
> > > I've ran a perf record for 10s and the top 4 of the events seem to be:
> > >
> > > 1. 54% is swapper in intel_idle_ibrs
> > > 2. 12% is swapper in intel_idle
> > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath:
> > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report
> >
> > 10s is pretty short... might consider a longer sample and then use the
> > perf.data to generate a flamegraph, e.g.:
> >
> > - Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph
> >   you will likely need to install some missing deps, e.g.:
> >   yum install perl-open.noarch
> > - export FLAME=/root/git/FlameGraph
> > - perf record -F 99 -a -g sleep 120
> >   - this will generate a perf.data output file.
> >
> > Once you have perf.data output, generate a flamegraph file (named
> > perf.svg) using these 2 commands:
> > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded
> > $FLAME/flamegraph.pl out.perf-folded > perf.svg
> >
> > Open the perf.svg image with your favorite image viewer (a web browser
> > works well).
> >
> > I just find flamegraph way more useful than 'perf report' ranked
> > ordering.
>
> That's a very good idea, thanks. I will try that when the issue returns.

The kswapd process started to consume some cpu again, so I've followed
this procedure. See the file in attach.

Does this show some sort of locking contention?

Regards,
Rik

>
> >
> > > Are there any know memory management changes related to NFS that have
> > > been introduced that could explain this behavior? What steps can I
> > > take to debug the root cause of this? Looking at iftop there isn't
> > > much going on regarding throughput. The top 3 NFS4 server operations
> > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s).
> >
> > You'd likely do well to expand the audience to include MM too (now cc'd).
>
> Thanks. All ideas on how I can determine the root cause of this is appreciated.
>
>
> Regards,
> Rik

-- 

Rik
Attachment:
perf2.svg

Description: image/svg