Re: non-stop kworker NFS/RPC write traffic even after unmount

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 16 May 2025 08:59:01 -0400

On 5/16/25 8:36 AM, Rik Theys wrote:
> Hi,
> 
> On 5/16/25 2:19 PM, Chuck Lever wrote:
>> On 5/16/25 7:32 AM, Rik Theys wrote:
>>> Hi,
>>>
>>> On 5/16/25 11:47 AM, Rik Theys wrote:
>>>> Hi,
>>>>
>>>> On 5/16/25 8:17 AM, Rik Theys wrote:
>>>>> Hi,
>>>>>
>>>>> On 5/16/25 7:51 AM, Rik Theys wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 4/18/25 3:31 PM, Daniel Kobras wrote:
>>>>>>> Hi Rik!
>>>>>>>
>>>>>>> Am 01.04.25 um 14:15 schrieb Rik Theys:
>>>>>>>> On 4/1/25 2:05 PM, Daniel Kobras wrote:
>>>>>>>>> Am 15.12.24 um 13:38 schrieb Rik Theys:
>>>>>>>>>> Suddenly, a number of clients start to send an abnormal amount
>>>>>>>>>> of NFS traffic to the server that saturates their link and never
>>>>>>>>>> seems to stop. Running iotop on the clients shows kworker-
>>>>>>>>>> {rpciod,nfsiod,xprtiod} processes generating the write traffic.
>>>>>>>>>> On the server side, the system seems to process the traffic as
>>>>>>>>>> the disks are processing the write requests.
>>>>>>>>>>
>>>>>>>>>> This behavior continues even after stopping all user processes
>>>>>>>>>> on the clients and unmounting the NFS mount on the client. Is
>>>>>>>>>> this normal? I was under the impression that once the NFS mount
>>>>>>>>>> is unmounted no further traffic to the server should be visible?
>>>>>>>>> I'm currently looking at an issue that resembles your description
>>>>>>>>> above (excess traffic to the server for data that was already
>>>>>>>>> written and committed), and part of the packet capture also looks
>>>>>>>>> roughly similar to what you've sent in a followup. Before I dig
>>>>>>>>> any deeper: Did you manage to pinpoint or resolve the problem in
>>>>>>>>> the meantime?
>>>>>>>> Our server is currently running the 6.12 LTS kernel and we haven't
>>>>>>>> had this specific issue any more. But we were never able to
>>>>>>>> reproduce it, so unfortunately I can't say for sure if it's fixed,
>>>>>>>> or what fixed it :-/.
>>>>>>> Thanks for the update! Indeed, in the meantime the affected
>>>>>>> environment here stopped showing the reported behavior as well
>>>>>>> after a few days, and I don't have a clear indication what might
>>>>>>> have been the fix, either.
>>>>>>>
>>>>>>> When the issue still occurred, it could (once) be provoked by
>>>>>>> dd'ing 4GB of /dev/zero to a test file on an NFSv4.2 mount. The
>>>>>>> network trace shows that the file is completely written at wire
>>>>>>> speed. But after a five second pause, the client then starts
>>>>>>> sending the same file again in smaller chunks of a few hundred MB
>>>>>>> at five second intervals. So it appears that the file's pages are
>>>>>>> background-flushed to storage again, even though they've already
>>>>>>> been written out. On the NFS layer, none of the passes look
>>>>>>> conspicuous to me: WRITE and COMMIT operations all get NFS4_OK'ed
>>>>>>> by the server.
>>>>>>>
>>>>>>>> Which kernel version(s) are your server and clients running?
>>>>>>> The systems in the affected environment run Debian-packaged
>>>>>>> kernels. The servers are on Debian's 6.1.0-32 which corresponds to
>>>>>>> upstream's 6.1.129. The issues was seen on clients running the same
>>>>>>> kernel version, but also on older systems running Debian's
>>>>>>> 5.10.0-33, corresponding to 5.10.226 upstream. I've skimmed the
>>>>>>> list of patches that went into either of these kernel versions, but
>>>>>>> nothing stood out as clearly related.
>>>>>>>
>>>>>> Our server and clients are currently showing the same behavior
>>>>>> again: clients are sending abnormal amounts of write traffic to the
>>>>>> NFS server and the server is actually processing it as the writes
>>>>>> end up on the disk (which fills up our replication journals). iotop
>>>>>> shows that the kworker-{rpciod,nfsiod,xprtiod} are responsible for
>>>>>> this traffic. A reboot of the server does not solve the issue. Also
>>>>>> rebooting individual clients that are participating in this does not
>>>>>> help. After a few minutes of user traffic they show the same
>>>>>> behavior again. We also see this on multiple clients at the same
>>>>>> time.
>>>>>>
>>>>>> The NFS operations that are being sent are mostly putfh, sequence
>>>>>> and getattr.
>>>>>>
>>>>>> The server is running upstream 6.12.25 and the clients are running
>>>>>> Rocky 8 (4.18.0-553.51.1.el8_10) and 9 (5.14.0-503.38.1.el9_5).
>>>>>>
>>>>>> What are some of the steps we can take to debug the root cause of
>>>>>> this? Any idea on how to stop this traffic flood?
>>>>>>
>>>>> I took a tcpdump on one of the clients that was doing this. The pcap
>>>>> was stored on the local disk of the server. When I tried to copy the
>>>>> pcap to our management server over scp it now hangs at 95%. The
>>>>> target disk on the management server is also an NFS mount of the
>>>>> affected server. The scp had copied 565MB and our management server
>>>>> has now also started to flood the server with non-stop traffic
>>>>> (basically saturating its link).
>>>>>
>>>>> The management server is running Debian's 6.1.135 kernel.
>>>>>
>>>>> It seems that once a client has triggered some bad state in the
>>>>> server, other clients that write a large file to the server also
>>>>> start to participate in this behavior. Rebooting the server does not
>>>>> seem to help as the same state is triggered almost immediately again
>>>>> by some client.
>>>>>
>>>> Now that the server is in this state, I can very easily reproduce this
>>>> on a client. I've installed the 6.14.6 kernel on a Rocky 9 client.
>>>>
>>>> 1. On a different machine, create an empty 3M file using "dd if=/dev/
>>>> zero of=3M bs=3M count=1"
>>>>
>>>> 2. Reboot the Rocky 9 client and log in as root. Verify that there are
>>>> no active NFS mounts to the server. Start dstat and watch the output.
>>>>
>>>> 3. From the machine where you created the 3M file, scp the 3M file to
>>>> the Rocky 9 client in a location that is an NFS mount of the server.
>>>> In this case it's my home directory which is automounted.
>>> I've reproduced the issue with rpcdebug on for rpc and nfs calls (see
>>> attachment).
>>>> The file copies normally, but when you look at the amount of data
>>>> transferred out of the client to the server it seems more than the 3M
>>>> file size.
>>> The client seems to copy the file twice in the initial copy. The first
>>> line on line 13623, which results in a lot of commit mismatch error
>>> messages.
>>>
>>> Then again on line 13842 which results in the same commit mismatch
>>> errors.
>>>
>>> These two attempts happen without any delay. This confirms my previous
>>> observation that the outbound traffic to the server is twice the file
>>> size.
>>>
>>> Then there's an NFS release call on the file.
>>>
>>> 30s later on line 14106, there's another attempt to write the file. This
>>> again results in the same commit mismatch errors.
>>>
>>> This process repeats itself every 30s.
>>>
>>> So it seems the server always returns a mismatch? Now, how can I solve
>>> this situation? I've tried rebooting the server last night, but the
>>> situation reappears as soon as clients start to perform writes.
>> Usually the write verifier will mismatch only after a server restart.
>>
>> However, there are some other rare cases where NFSD will bump the
>> write verifier. If an error occurs when the server tries to sync
>> unstable NFS WRITEs to persistent storage, NFSD will change the
>> write verifier to force the client to send the write payloads again.
>>
>> A writeback error might include a failing disk or a full file system,
>> so that's the first place you should look.
>>
>>
>> But why the clients don't switch to FILE_SYNC when retrying the
>> writes is still a mystery. When they do that, the disk errors will
>> be exposed to the client and application and you can figure out
>> immediately what is going wrong.
>>
> There are no indications of a failing disk on the system (and the disks
> are FC attached SAN disks) and the file systems that have the high write
> I/O have sufficient free space available. Or can a "disk full" message
> also be caused by disk quota being exceeded? As we do use disk quotas.

That seems like something to explore.

The problem is that the NFS protocol does not have a mechanism to expose
write errors that occur on the server after it responds to an NFS
UNSTABLE WRITE: NFS_OK, we received your data, but before the COMMIT
occurs.

When a COMMIT fails in this way, clients are supposed to change to
FILE_SYNC and try the writes again. A FILE_SYNC WRITE flushes all the
way to disk so any recurring error appears as part of the NFS
operation's status code. The client is supposed to treat this as a
permanent error and stop the loop.

> Based on your last paragraph I conclude this is a client side issue? The
> client should switch to FILE_SYNC instead? We do export the NFS share
> "async". Does that make a difference?

I don't know, because anyone who uses async is asking for trouble
so we don't test it as an option that should be deployed in a
production environment. All I can say is "don't do that."

> So it's a normal operation for the server to change the write verifier?

It's not a protocol compliance issue, if that's what you mean. Clients
are supposed to be prepared for a write verifier change on COMMIT, full
stop. That's why the verifier is there in the protocol.

> The clients that showed this behavior ran a lot of different kernel
> versions, from the RHEL 8/9 kernels, the Debian 12 (6.1 series), Fedora
> 42 kernel and the 6.14.6 kernel on a Rocky 9 userland. So this must be
> an issue that is present in the client code for a very long time now.
> 
> Since approx 14:00 the issue has suddenly disappeared as suddenly as it
> started. I can no longer reproduce it now.

Then hitting a capacity or metadata quota might be the proximate cause.

If this is an NFSv4.2 mount, is it possible that the clients are
trying to do a large COPY operation but falling back to read/write?

-- 
Chuck Lever