Re: non-stop kworker NFS/RPC write traffic even after unmount

Rik Theys <Rik.Theys@xxxxxxxxxxxxxxxx> · Fri, 16 May 2025 15:09:09 +0200

Hi,

On 5/16/25 2:59 PM, Chuck Lever wrote:
On 5/16/25 8:36 AM, Rik Theys wrote:
Hi,

On 5/16/25 2:19 PM, Chuck Lever wrote:
On 5/16/25 7:32 AM, Rik Theys wrote:
Hi,

On 5/16/25 11:47 AM, Rik Theys wrote:
Hi,

On 5/16/25 8:17 AM, Rik Theys wrote:
Hi,

On 5/16/25 7:51 AM, Rik Theys wrote:
Hi,

On 4/18/25 3:31 PM, Daniel Kobras wrote:
Hi Rik!

Am 01.04.25 um 14:15 schrieb Rik Theys:
On 4/1/25 2:05 PM, Daniel Kobras wrote:
Am 15.12.24 um 13:38 schrieb Rik Theys:
Suddenly, a number of clients start to send an abnormal amount
of NFS traffic to the server that saturates their link and never
seems to stop. Running iotop on the clients shows kworker-
{rpciod,nfsiod,xprtiod} processes generating the write traffic.
On the server side, the system seems to process the traffic as
the disks are processing the write requests.

This behavior continues even after stopping all user processes
on the clients and unmounting the NFS mount on the client. Is
this normal? I was under the impression that once the NFS mount
is unmounted no further traffic to the server should be visible?
I'm currently looking at an issue that resembles your description
above (excess traffic to the server for data that was already
written and committed), and part of the packet capture also looks
roughly similar to what you've sent in a followup. Before I dig
any deeper: Did you manage to pinpoint or resolve the problem in
the meantime?
Our server is currently running the 6.12 LTS kernel and we haven't
had this specific issue any more. But we were never able to
reproduce it, so unfortunately I can't say for sure if it's fixed,
or what fixed it :-/.
Thanks for the update! Indeed, in the meantime the affected
environment here stopped showing the reported behavior as well
after a few days, and I don't have a clear indication what might
have been the fix, either.

When the issue still occurred, it could (once) be provoked by
dd'ing 4GB of /dev/zero to a test file on an NFSv4.2 mount. The
network trace shows that the file is completely written at wire
speed. But after a five second pause, the client then starts
sending the same file again in smaller chunks of a few hundred MB
at five second intervals. So it appears that the file's pages are
background-flushed to storage again, even though they've already
been written out. On the NFS layer, none of the passes look
conspicuous to me: WRITE and COMMIT operations all get NFS4_OK'ed
by the server.

Which kernel version(s) are your server and clients running?
The systems in the affected environment run Debian-packaged
kernels. The servers are on Debian's 6.1.0-32 which corresponds to
upstream's 6.1.129. The issues was seen on clients running the same
kernel version, but also on older systems running Debian's
5.10.0-33, corresponding to 5.10.226 upstream. I've skimmed the
list of patches that went into either of these kernel versions, but
nothing stood out as clearly related.

Our server and clients are currently showing the same behavior
again: clients are sending abnormal amounts of write traffic to the
NFS server and the server is actually processing it as the writes
end up on the disk (which fills up our replication journals). iotop
shows that the kworker-{rpciod,nfsiod,xprtiod} are responsible for
this traffic. A reboot of the server does not solve the issue. Also
rebooting individual clients that are participating in this does not
help. After a few minutes of user traffic they show the same
behavior again. We also see this on multiple clients at the same
time.

The NFS operations that are being sent are mostly putfh, sequence
and getattr.

The server is running upstream 6.12.25 and the clients are running
Rocky 8 (4.18.0-553.51.1.el8_10) and 9 (5.14.0-503.38.1.el9_5).

What are some of the steps we can take to debug the root cause of
this? Any idea on how to stop this traffic flood?

I took a tcpdump on one of the clients that was doing this. The pcap
was stored on the local disk of the server. When I tried to copy the
pcap to our management server over scp it now hangs at 95%. The
target disk on the management server is also an NFS mount of the
affected server. The scp had copied 565MB and our management server
has now also started to flood the server with non-stop traffic
(basically saturating its link).

The management server is running Debian's 6.1.135 kernel.

It seems that once a client has triggered some bad state in the
server, other clients that write a large file to the server also
start to participate in this behavior. Rebooting the server does not
seem to help as the same state is triggered almost immediately again
by some client.

Now that the server is in this state, I can very easily reproduce this
on a client. I've installed the 6.14.6 kernel on a Rocky 9 client.

1. On a different machine, create an empty 3M file using "dd if=/dev/
zero of=3M bs=3M count=1"

2. Reboot the Rocky 9 client and log in as root. Verify that there are
no active NFS mounts to the server. Start dstat and watch the output.

3. From the machine where you created the 3M file, scp the 3M file to
the Rocky 9 client in a location that is an NFS mount of the server.
In this case it's my home directory which is automounted.
I've reproduced the issue with rpcdebug on for rpc and nfs calls (see
attachment).
The file copies normally, but when you look at the amount of data
transferred out of the client to the server it seems more than the 3M
file size.
The client seems to copy the file twice in the initial copy. The first
line on line 13623, which results in a lot of commit mismatch error
messages.

Then again on line 13842 which results in the same commit mismatch
errors.

These two attempts happen without any delay. This confirms my previous
observation that the outbound traffic to the server is twice the file
size.

Then there's an NFS release call on the file.

30s later on line 14106, there's another attempt to write the file. This
again results in the same commit mismatch errors.

This process repeats itself every 30s.

So it seems the server always returns a mismatch? Now, how can I solve
this situation? I've tried rebooting the server last night, but the
situation reappears as soon as clients start to perform writes.
Usually the write verifier will mismatch only after a server restart.

However, there are some other rare cases where NFSD will bump the
write verifier. If an error occurs when the server tries to sync
unstable NFS WRITEs to persistent storage, NFSD will change the
write verifier to force the client to send the write payloads again.

A writeback error might include a failing disk or a full file system,
so that's the first place you should look.

But why the clients don't switch to FILE_SYNC when retrying the
writes is still a mystery. When they do that, the disk errors will
be exposed to the client and application and you can figure out
immediately what is going wrong.

There are no indications of a failing disk on the system (and the disks
are FC attached SAN disks) and the file systems that have the high write
I/O have sufficient free space available. Or can a "disk full" message
also be caused by disk quota being exceeded? As we do use disk quotas.
That seems like something to explore.

It's also strange that this would affect clients that are writing to the 
same NFS filesystem but as a user that doesn't have any quota limits 
exceeded, no? Or does the server interpret the "quota exceeded" for one 
user on that filesystem as a global error for that filesystem?

The problem is that the NFS protocol does not have a mechanism to expose
write errors that occur on the server after it responds to an NFS
UNSTABLE WRITE: NFS_OK, we received your data, but before the COMMIT
occurs.

When a COMMIT fails in this way, clients are supposed to change to
FILE_SYNC and try the writes again. A FILE_SYNC WRITE flushes all the
way to disk so any recurring error appears as part of the NFS
operation's status code. The client is supposed to treat this as a
permanent error and stop the loop.

Then there's probably a bug in the client code somewhere as the 
client(s) did not do that...
Based on your last paragraph I conclude this is a client side issue? The
client should switch to FILE_SYNC instead? We do export the NFS share
"async". Does that make a difference?
I don't know, because anyone who uses async is asking for trouble
so we don't test it as an option that should be deployed in a
production environment. All I can say is "don't do that."

So it's a normal operation for the server to change the write verifier?
It's not a protocol compliance issue, if that's what you mean. Clients
are supposed to be prepared for a write verifier change on COMMIT, full
stop. That's why the verifier is there in the protocol.

The clients that showed this behavior ran a lot of different kernel
versions, from the RHEL 8/9 kernels, the Debian 12 (6.1 series), Fedora
42 kernel and the 6.14.6 kernel on a Rocky 9 userland. So this must be
an issue that is present in the client code for a very long time now.

Since approx 14:00 the issue has suddenly disappeared as suddenly as it
started. I can no longer reproduce it now.
Then hitting a capacity or metadata quota might be the proximate cause.

If this is an NFSv4.2 mount, is it possible that the clients are
trying to do a large COPY operation but falling back to read/write?

All mounts are indeed 4.2 mounts. It's possible that some clients were 
doing COPY operations, but when I reproduced the issue on my test client 
it did not perform a COPY operation (as far as I understand the 
pcap/rpcdebug output).

Are there any debugging steps we can take when this issue happens again?

Regards,

--
Rik Theys
System Engineer
KU Leuven - Dept. Elektrotechniek (ESAT)
Kasteelpark Arenberg 10 bus 2440  - B-3001 Leuven-Heverlee
+32(0)16/32.11.07
----------------------------------------------------------------
<<Any errors in spelling, tact or fact are transmission errors>>