Hi, On Tue, Apr 1, 2025 at 10:07 PM Rik Theys <rik.theys@xxxxxxxxx> wrote: > > Hi, > > On Tue, Apr 1, 2025 at 9:31 PM Rik Theys <rik.theys@xxxxxxxxx> wrote: > > > > Hi, > > > > On Tue, Apr 1, 2025 at 7:40 PM Mike Snitzer <snitzer@xxxxxxxxxx> wrote: > > > > > > On Mon, Mar 31, 2025 at 09:05:54PM +0200, Rik Theys wrote: > > > > Hi, > > > > > > > > Our fileserver is currently running 6.12.13 with the following 3 > > > > patches (from nfsd-testing) applied to it: > > > > > > > > - fix-decoding-in-nfs4_xdr_dec_cb_getattr > > > > - skip-sending-CB_RECALL_ANY > > > > - fix-cb_getattr_status-fix > > > > > > > > Frequently the load on the system goes up and top shows a lot of > > > > kswapd and kcompact threads next to nfsd threads. During these period > > > > (which can last for hours), users complain about very slow NFS access. > > > > We have approx 260 systems connecting to this server and the number of > > > > nfs client states (from the states files in the clients directory) are > > > > around 200000. > > > > > > Are any of these clients connecting to a server from the same host? > > > Only reason I ask is I fixed a recursion deadlock that manifested in > > > testing when memory was very low and LOCALIO used to loopback mount on > > > the same host. See: > > > > > > ce6d9c1c2b5cc785 ("NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback") > > > https://git.kernel.org/linus/ce6d9c1c2b5cc785 > > > > > > (I suspect you aren't using NFS loopback mounts at all otherwise your > > > report would indicate breadcrumbs like I mentioned in my commit, > > > e.g. "task kcompactd0:58 blocked for more than 4435 seconds"). > > > > Normally the server does not NFS mount itself. We also don't have any > > "blocked task" messages reported in dmesg. > > > > > > > > > When I look at our monitoring logs, the system has frequent direct > > > > reclaim stalls (allocstall_movable, and some allocstall_normal) and > > > > pgscan_kswapd goes up to ~10000000. The kswapd_low_wmark_hit_quickly > > > > is about 50. So it seems the system is out of memory and is constantly > > > > trying to free pages? If I understand it correctly the system hits a > > > > threshold which makes it scan for pages to free, frees some pages and > > > > when it stops it very quickly hits the low watermark again? > > > > > > > > But the system has over 150G of memory dedicated to cache, and > > > > slab_reclaim is only about 16G. Why is the system not dropping more > > > > caches to free memory instead of constantly looking to free memory? Is > > > > there a tunable that we can set so the system will prefer to drop > > > > caches and increase memory usage for other nfsd related things? Any > > > > tips on how to debug where the memory pressure is coming from, or why > > > > the system decides to keep the pages used for cache instead of freeing > > > > some of those? Could this be related to https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/commit/?h=linux-6.12.y&id=e21ce310556ec40b5b2987e02d12ca7109a33a61: mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT commit 182db972c9568dc530b2f586a2f82dfd039d9f2a upstream. This is fixed in a later 6.12.x kernel, but we're still running 6.12.13 currently. Regards, Rik > > > > The issue is currently not happening, but I've looked at some of our > > sar statistics from today: > > > > # sar -B > > 04:00:00 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s > > pgscank/s pgscand/s pgsteal/s %vmeff > > 04:00:00 PM 6570.43 37504.61 1937.60 0.20 337274.24 > > 10817339.49 0.00 10623.60 0.10 > > 04:10:03 PM 6266.09 28821.33 4392.91 0.65 266336.28 > > 8464619.82 0.00 7756.98 0.09 > > 04:20:05 PM 6894.44 33790.76 12713.86 1.86 271167.36 > > 9689653.88 0.00 8123.21 0.08 > > 04:30:03 PM 6839.52 24451.70 1693.22 0.76 237536.27 > > 9268350.05 11.73 5339.54 0.06 > > 04:40:05 PM 6197.73 28958.02 4260.95 0.33 306245.10 > > 9797882.50 0.00 7892.46 0.08 > > 04:50:02 PM 4252.11 31658.28 1849.64 0.58 297727.92 > > 6885422.57 0.00 7541.08 0.11 > > > > # sar -r > > 04:00:00 PM kbmemfree kbavail kbmemused %memused kbbuffers > > kbcached kbcommit %commit kbactive kbinact kbdirty > > 04:00:00 PM 3942896 180501232 2652336 1.35 29594476 > > 138477148 3949924 1.50 48038428 120797592 13324 > > 04:10:03 PM 4062416 180601484 2564852 1.31 29574180 > > 138589324 3974652 1.51 47664880 121277920 157472 > > 04:20:05 PM 4131172 180150888 3013128 1.54 29669384 > > 138076684 3969232 1.51 47325688 121184212 4448 > > 04:30:03 PM 4112388 180835756 2344936 1.20 30338956 > > 138145972 3883420 1.48 49014976 120205032 5072 > > 04:40:05 PM 3892332 179390408 3428992 1.75 30559972 > > 137103196 3852380 1.46 48939020 119461684 306336 > > 04:50:02 PM 4328220 180002072 3197120 1.63 30873116 > > 136567640 3891224 1.48 49335740 118841092 3412 > > > > # sar -W > > 04:00:00 PM pswpin/s pswpout/s > > 04:00:00 PM 0.09 0.29 > > 04:10:03 PM 0.33 0.60 > > 04:20:05 PM 0.20 0.38 > > 04:30:03 PM 0.69 0.33 > > 04:40:05 PM 0.36 0.72 > > 04:50:02 PM 0.30 0.46 > > > > If I read this correctly, the systems is scanning scanning for free > > pages (pgscand) and freeing some of them (pgfree), but the efficiency > > is low (%vmeff). > > At the same time, the amount of memory used (kbmemused / %memused) is > > quite low as most of the memory is used as cache. There's approx 120G > > of inactive memory. > > So I'm at loss as to why the system is performing these page scans and > > stalling instead of dropping some of the cache and using that instead. > > > > > > > > All good questions, to which I don't have immediate answers (but > > > others may). > > > > > > Just FYI: there is a slow-start development TODO to leverage 6.14's > > > "DONTCACHE" support (particularly in nfsd, but client might benefit > > > some too) to avoid nfsd writeback stalls due to memory being > > > fragmented and reclaim having to work too hard (in concert with > > > kcompactd) to find adequate pages. > > > > > > > I've ran a perf record for 10s and the top 4 of the events seem to be: > > > > > > > > 1. 54% is swapper in intel_idle_ibrs > > > > 2. 12% is swapper in intel_idle > > > > 3. 7.43% is nfsd in native_queued_spin_lock_slowpath: > > > > 4. 5% is kswapd0 in __list_del_entry_valid_or_report > > > > > > 10s is pretty short... might consider a longer sample and then use the > > > perf.data to generate a flamegraph, e.g.: > > > > > > - Download Flamegraph project: git clone https://github.com/brendangregg/FlameGraph > > > you will likely need to install some missing deps, e.g.: > > > yum install perl-open.noarch > > > - export FLAME=/root/git/FlameGraph > > > - perf record -F 99 -a -g sleep 120 > > > - this will generate a perf.data output file. > > > > > > Once you have perf.data output, generate a flamegraph file (named > > > perf.svg) using these 2 commands: > > > perf script | $FLAME/stackcollapse-perf.pl > out.perf-folded > > > $FLAME/flamegraph.pl out.perf-folded > perf.svg > > > > > > Open the perf.svg image with your favorite image viewer (a web browser > > > works well). > > > > > > I just find flamegraph way more useful than 'perf report' ranked > > > ordering. > > > > That's a very good idea, thanks. I will try that when the issue returns. > > The kswapd process started to consume some cpu again, so I've followed > this procedure. See the file in attach. > > Does this show some sort of locking contention? > > Regards, > Rik > > > > > > > > > > Are there any know memory management changes related to NFS that have > > > > been introduced that could explain this behavior? What steps can I > > > > take to debug the root cause of this? Looking at iftop there isn't > > > > much going on regarding throughput. The top 3 NFS4 server operations > > > > are sequence 9563/s), putfh(9032/s) and getattr (7150/s). > > > > > > You'd likely do well to expand the audience to include MM too (now cc'd). > > > > Thanks. All ideas on how I can determine the root cause of this is appreciated. > > > > > > Regards, > > Rik > > > > -- > > Rik -- Rik