Re: non-stop kworker NFS/RPC write traffic even after unmount

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 16 May 2025 09:33:15 -0400

On 5/16/25 9:09 AM, Rik Theys wrote:
> Hi,
> 
> On 5/16/25 2:59 PM, Chuck Lever wrote:
>> On 5/16/25 8:36 AM, Rik Theys wrote:
>>> Hi,
>>>
>>> On 5/16/25 2:19 PM, Chuck Lever wrote:
>>>> On 5/16/25 7:32 AM, Rik Theys wrote:
>>>>> Hi,
>>>>>
>>>>> On 5/16/25 11:47 AM, Rik Theys wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 5/16/25 8:17 AM, Rik Theys wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 5/16/25 7:51 AM, Rik Theys wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 4/18/25 3:31 PM, Daniel Kobras wrote:
>>>>>>>>> Hi Rik!
>>>>>>>>>
>>>>>>>>> Am 01.04.25 um 14:15 schrieb Rik Theys:
>>>>>>>>>> On 4/1/25 2:05 PM, Daniel Kobras wrote:
>>>>>>>>>>> Am 15.12.24 um 13:38 schrieb Rik Theys:
>>>>>>>>>>>> Suddenly, a number of clients start to send an abnormal amount
>>>>>>>>>>>> of NFS traffic to the server that saturates their link and
>>>>>>>>>>>> never
>>>>>>>>>>>> seems to stop. Running iotop on the clients shows kworker-
>>>>>>>>>>>> {rpciod,nfsiod,xprtiod} processes generating the write traffic.
>>>>>>>>>>>> On the server side, the system seems to process the traffic as
>>>>>>>>>>>> the disks are processing the write requests.
>>>>>>>>>>>>
>>>>>>>>>>>> This behavior continues even after stopping all user processes
>>>>>>>>>>>> on the clients and unmounting the NFS mount on the client. Is
>>>>>>>>>>>> this normal? I was under the impression that once the NFS mount
>>>>>>>>>>>> is unmounted no further traffic to the server should be
>>>>>>>>>>>> visible?
>>>>>>>>>>> I'm currently looking at an issue that resembles your
>>>>>>>>>>> description
>>>>>>>>>>> above (excess traffic to the server for data that was already
>>>>>>>>>>> written and committed), and part of the packet capture also
>>>>>>>>>>> looks
>>>>>>>>>>> roughly similar to what you've sent in a followup. Before I dig
>>>>>>>>>>> any deeper: Did you manage to pinpoint or resolve the problem in
>>>>>>>>>>> the meantime?
>>>>>>>>>> Our server is currently running the 6.12 LTS kernel and we
>>>>>>>>>> haven't
>>>>>>>>>> had this specific issue any more. But we were never able to
>>>>>>>>>> reproduce it, so unfortunately I can't say for sure if it's
>>>>>>>>>> fixed,
>>>>>>>>>> or what fixed it :-/.
>>>>>>>>> Thanks for the update! Indeed, in the meantime the affected
>>>>>>>>> environment here stopped showing the reported behavior as well
>>>>>>>>> after a few days, and I don't have a clear indication what might
>>>>>>>>> have been the fix, either.
>>>>>>>>>
>>>>>>>>> When the issue still occurred, it could (once) be provoked by
>>>>>>>>> dd'ing 4GB of /dev/zero to a test file on an NFSv4.2 mount. The
>>>>>>>>> network trace shows that the file is completely written at wire
>>>>>>>>> speed. But after a five second pause, the client then starts
>>>>>>>>> sending the same file again in smaller chunks of a few hundred MB
>>>>>>>>> at five second intervals. So it appears that the file's pages are
>>>>>>>>> background-flushed to storage again, even though they've already
>>>>>>>>> been written out. On the NFS layer, none of the passes look
>>>>>>>>> conspicuous to me: WRITE and COMMIT operations all get NFS4_OK'ed
>>>>>>>>> by the server.
>>>>>>>>>
>>>>>>>>>> Which kernel version(s) are your server and clients running?
>>>>>>>>> The systems in the affected environment run Debian-packaged
>>>>>>>>> kernels. The servers are on Debian's 6.1.0-32 which corresponds to
>>>>>>>>> upstream's 6.1.129. The issues was seen on clients running the
>>>>>>>>> same
>>>>>>>>> kernel version, but also on older systems running Debian's
>>>>>>>>> 5.10.0-33, corresponding to 5.10.226 upstream. I've skimmed the
>>>>>>>>> list of patches that went into either of these kernel versions,
>>>>>>>>> but
>>>>>>>>> nothing stood out as clearly related.
>>>>>>>>>
>>>>>>>> Our server and clients are currently showing the same behavior
>>>>>>>> again: clients are sending abnormal amounts of write traffic to the
>>>>>>>> NFS server and the server is actually processing it as the writes
>>>>>>>> end up on the disk (which fills up our replication journals). iotop
>>>>>>>> shows that the kworker-{rpciod,nfsiod,xprtiod} are responsible for
>>>>>>>> this traffic. A reboot of the server does not solve the issue. Also
>>>>>>>> rebooting individual clients that are participating in this does
>>>>>>>> not
>>>>>>>> help. After a few minutes of user traffic they show the same
>>>>>>>> behavior again. We also see this on multiple clients at the same
>>>>>>>> time.
>>>>>>>>
>>>>>>>> The NFS operations that are being sent are mostly putfh, sequence
>>>>>>>> and getattr.
>>>>>>>>
>>>>>>>> The server is running upstream 6.12.25 and the clients are running
>>>>>>>> Rocky 8 (4.18.0-553.51.1.el8_10) and 9 (5.14.0-503.38.1.el9_5).
>>>>>>>>
>>>>>>>> What are some of the steps we can take to debug the root cause of
>>>>>>>> this? Any idea on how to stop this traffic flood?
>>>>>>>>
>>>>>>> I took a tcpdump on one of the clients that was doing this. The pcap
>>>>>>> was stored on the local disk of the server. When I tried to copy the
>>>>>>> pcap to our management server over scp it now hangs at 95%. The
>>>>>>> target disk on the management server is also an NFS mount of the
>>>>>>> affected server. The scp had copied 565MB and our management server
>>>>>>> has now also started to flood the server with non-stop traffic
>>>>>>> (basically saturating its link).
>>>>>>>
>>>>>>> The management server is running Debian's 6.1.135 kernel.
>>>>>>>
>>>>>>> It seems that once a client has triggered some bad state in the
>>>>>>> server, other clients that write a large file to the server also
>>>>>>> start to participate in this behavior. Rebooting the server does not
>>>>>>> seem to help as the same state is triggered almost immediately again
>>>>>>> by some client.
>>>>>>>
>>>>>> Now that the server is in this state, I can very easily reproduce
>>>>>> this
>>>>>> on a client. I've installed the 6.14.6 kernel on a Rocky 9 client.
>>>>>>
>>>>>> 1. On a different machine, create an empty 3M file using "dd if=/dev/
>>>>>> zero of=3M bs=3M count=1"
>>>>>>
>>>>>> 2. Reboot the Rocky 9 client and log in as root. Verify that there
>>>>>> are
>>>>>> no active NFS mounts to the server. Start dstat and watch the output.
>>>>>>
>>>>>> 3. From the machine where you created the 3M file, scp the 3M file to
>>>>>> the Rocky 9 client in a location that is an NFS mount of the server.
>>>>>> In this case it's my home directory which is automounted.
>>>>> I've reproduced the issue with rpcdebug on for rpc and nfs calls (see
>>>>> attachment).
>>>>>> The file copies normally, but when you look at the amount of data
>>>>>> transferred out of the client to the server it seems more than the 3M
>>>>>> file size.
>>>>> The client seems to copy the file twice in the initial copy. The first
>>>>> line on line 13623, which results in a lot of commit mismatch error
>>>>> messages.
>>>>>
>>>>> Then again on line 13842 which results in the same commit mismatch
>>>>> errors.
>>>>>
>>>>> These two attempts happen without any delay. This confirms my previous
>>>>> observation that the outbound traffic to the server is twice the file
>>>>> size.
>>>>>
>>>>> Then there's an NFS release call on the file.
>>>>>
>>>>> 30s later on line 14106, there's another attempt to write the file.
>>>>> This
>>>>> again results in the same commit mismatch errors.
>>>>>
>>>>> This process repeats itself every 30s.
>>>>>
>>>>> So it seems the server always returns a mismatch? Now, how can I solve
>>>>> this situation? I've tried rebooting the server last night, but the
>>>>> situation reappears as soon as clients start to perform writes.
>>>> Usually the write verifier will mismatch only after a server restart.
>>>>
>>>> However, there are some other rare cases where NFSD will bump the
>>>> write verifier. If an error occurs when the server tries to sync
>>>> unstable NFS WRITEs to persistent storage, NFSD will change the
>>>> write verifier to force the client to send the write payloads again.
>>>>
>>>> A writeback error might include a failing disk or a full file system,
>>>> so that's the first place you should look.
>>>>
>>>>
>>>> But why the clients don't switch to FILE_SYNC when retrying the
>>>> writes is still a mystery. When they do that, the disk errors will
>>>> be exposed to the client and application and you can figure out
>>>> immediately what is going wrong.
>>>>
>>> There are no indications of a failing disk on the system (and the disks
>>> are FC attached SAN disks) and the file systems that have the high write
>>> I/O have sufficient free space available. Or can a "disk full" message
>>> also be caused by disk quota being exceeded? As we do use disk quotas.
>> That seems like something to explore.
> 
> It's also strange that this would affect clients that are writing to the
> same NFS filesystem but as a user that doesn't have any quota limits
> exceeded, no? Or does the server interpret the "quota exceeded" for one
> user on that filesystem as a global error for that filesystem?
> 
> 
>>
>> The problem is that the NFS protocol does not have a mechanism to expose
>> write errors that occur on the server after it responds to an NFS
>> UNSTABLE WRITE: NFS_OK, we received your data, but before the COMMIT
>> occurs.
>>
>> When a COMMIT fails in this way, clients are supposed to change to
>> FILE_SYNC and try the writes again. A FILE_SYNC WRITE flushes all the
>> way to disk so any recurring error appears as part of the NFS
>> operation's status code. The client is supposed to treat this as a
>> permanent error and stop the loop.
>>
> Then there's probably a bug in the client code somewhere as the
> client(s) did not do that...
>>> Based on your last paragraph I conclude this is a client side issue? The
>>> client should switch to FILE_SYNC instead? We do export the NFS share
>>> "async". Does that make a difference?
>> I don't know, because anyone who uses async is asking for trouble
>> so we don't test it as an option that should be deployed in a
>> production environment. All I can say is "don't do that."
>>
>>
>>> So it's a normal operation for the server to change the write verifier?
>> It's not a protocol compliance issue, if that's what you mean. Clients
>> are supposed to be prepared for a write verifier change on COMMIT, full
>> stop. That's why the verifier is there in the protocol.
>>
>>
>>> The clients that showed this behavior ran a lot of different kernel
>>> versions, from the RHEL 8/9 kernels, the Debian 12 (6.1 series), Fedora
>>> 42 kernel and the 6.14.6 kernel on a Rocky 9 userland. So this must be
>>> an issue that is present in the client code for a very long time now.
>>>
>>> Since approx 14:00 the issue has suddenly disappeared as suddenly as it
>>> started. I can no longer reproduce it now.
>> Then hitting a capacity or metadata quota might be the proximate cause.
>>
>> If this is an NFSv4.2 mount, is it possible that the clients are
>> trying to do a large COPY operation but falling back to read/write?
>>
> All mounts are indeed 4.2 mounts. It's possible that some clients were
> doing COPY operations, but when I reproduced the issue on my test client
> it did not perform a COPY operation (as far as I understand the pcap/
> rpcdebug output).
> 
> Are there any debugging steps we can take when this issue happens again?

Since v5.17, there is a trace point in NFSD called
trace_nfsd_writeverf_reset.

I really don't have much experience to share regarding your filesystem
and storage specific failures. You will have to use Google and
StackOverflow for those details.

-- 
Chuck Lever