On 5/16/25 9:09 AM, Rik Theys wrote: > Hi, > > On 5/16/25 2:59 PM, Chuck Lever wrote: >> On 5/16/25 8:36 AM, Rik Theys wrote: >>> Hi, >>> >>> On 5/16/25 2:19 PM, Chuck Lever wrote: >>>> On 5/16/25 7:32 AM, Rik Theys wrote: >>>>> Hi, >>>>> >>>>> On 5/16/25 11:47 AM, Rik Theys wrote: >>>>>> Hi, >>>>>> >>>>>> On 5/16/25 8:17 AM, Rik Theys wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 5/16/25 7:51 AM, Rik Theys wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On 4/18/25 3:31 PM, Daniel Kobras wrote: >>>>>>>>> Hi Rik! >>>>>>>>> >>>>>>>>> Am 01.04.25 um 14:15 schrieb Rik Theys: >>>>>>>>>> On 4/1/25 2:05 PM, Daniel Kobras wrote: >>>>>>>>>>> Am 15.12.24 um 13:38 schrieb Rik Theys: >>>>>>>>>>>> Suddenly, a number of clients start to send an abnormal amount >>>>>>>>>>>> of NFS traffic to the server that saturates their link and >>>>>>>>>>>> never >>>>>>>>>>>> seems to stop. Running iotop on the clients shows kworker- >>>>>>>>>>>> {rpciod,nfsiod,xprtiod} processes generating the write traffic. >>>>>>>>>>>> On the server side, the system seems to process the traffic as >>>>>>>>>>>> the disks are processing the write requests. >>>>>>>>>>>> >>>>>>>>>>>> This behavior continues even after stopping all user processes >>>>>>>>>>>> on the clients and unmounting the NFS mount on the client. Is >>>>>>>>>>>> this normal? I was under the impression that once the NFS mount >>>>>>>>>>>> is unmounted no further traffic to the server should be >>>>>>>>>>>> visible? >>>>>>>>>>> I'm currently looking at an issue that resembles your >>>>>>>>>>> description >>>>>>>>>>> above (excess traffic to the server for data that was already >>>>>>>>>>> written and committed), and part of the packet capture also >>>>>>>>>>> looks >>>>>>>>>>> roughly similar to what you've sent in a followup. Before I dig >>>>>>>>>>> any deeper: Did you manage to pinpoint or resolve the problem in >>>>>>>>>>> the meantime? >>>>>>>>>> Our server is currently running the 6.12 LTS kernel and we >>>>>>>>>> haven't >>>>>>>>>> had this specific issue any more. But we were never able to >>>>>>>>>> reproduce it, so unfortunately I can't say for sure if it's >>>>>>>>>> fixed, >>>>>>>>>> or what fixed it :-/. >>>>>>>>> Thanks for the update! Indeed, in the meantime the affected >>>>>>>>> environment here stopped showing the reported behavior as well >>>>>>>>> after a few days, and I don't have a clear indication what might >>>>>>>>> have been the fix, either. >>>>>>>>> >>>>>>>>> When the issue still occurred, it could (once) be provoked by >>>>>>>>> dd'ing 4GB of /dev/zero to a test file on an NFSv4.2 mount. The >>>>>>>>> network trace shows that the file is completely written at wire >>>>>>>>> speed. But after a five second pause, the client then starts >>>>>>>>> sending the same file again in smaller chunks of a few hundred MB >>>>>>>>> at five second intervals. So it appears that the file's pages are >>>>>>>>> background-flushed to storage again, even though they've already >>>>>>>>> been written out. On the NFS layer, none of the passes look >>>>>>>>> conspicuous to me: WRITE and COMMIT operations all get NFS4_OK'ed >>>>>>>>> by the server. >>>>>>>>> >>>>>>>>>> Which kernel version(s) are your server and clients running? >>>>>>>>> The systems in the affected environment run Debian-packaged >>>>>>>>> kernels. The servers are on Debian's 6.1.0-32 which corresponds to >>>>>>>>> upstream's 6.1.129. The issues was seen on clients running the >>>>>>>>> same >>>>>>>>> kernel version, but also on older systems running Debian's >>>>>>>>> 5.10.0-33, corresponding to 5.10.226 upstream. I've skimmed the >>>>>>>>> list of patches that went into either of these kernel versions, >>>>>>>>> but >>>>>>>>> nothing stood out as clearly related. >>>>>>>>> >>>>>>>> Our server and clients are currently showing the same behavior >>>>>>>> again: clients are sending abnormal amounts of write traffic to the >>>>>>>> NFS server and the server is actually processing it as the writes >>>>>>>> end up on the disk (which fills up our replication journals). iotop >>>>>>>> shows that the kworker-{rpciod,nfsiod,xprtiod} are responsible for >>>>>>>> this traffic. A reboot of the server does not solve the issue. Also >>>>>>>> rebooting individual clients that are participating in this does >>>>>>>> not >>>>>>>> help. After a few minutes of user traffic they show the same >>>>>>>> behavior again. We also see this on multiple clients at the same >>>>>>>> time. >>>>>>>> >>>>>>>> The NFS operations that are being sent are mostly putfh, sequence >>>>>>>> and getattr. >>>>>>>> >>>>>>>> The server is running upstream 6.12.25 and the clients are running >>>>>>>> Rocky 8 (4.18.0-553.51.1.el8_10) and 9 (5.14.0-503.38.1.el9_5). >>>>>>>> >>>>>>>> What are some of the steps we can take to debug the root cause of >>>>>>>> this? Any idea on how to stop this traffic flood? >>>>>>>> >>>>>>> I took a tcpdump on one of the clients that was doing this. The pcap >>>>>>> was stored on the local disk of the server. When I tried to copy the >>>>>>> pcap to our management server over scp it now hangs at 95%. The >>>>>>> target disk on the management server is also an NFS mount of the >>>>>>> affected server. The scp had copied 565MB and our management server >>>>>>> has now also started to flood the server with non-stop traffic >>>>>>> (basically saturating its link). >>>>>>> >>>>>>> The management server is running Debian's 6.1.135 kernel. >>>>>>> >>>>>>> It seems that once a client has triggered some bad state in the >>>>>>> server, other clients that write a large file to the server also >>>>>>> start to participate in this behavior. Rebooting the server does not >>>>>>> seem to help as the same state is triggered almost immediately again >>>>>>> by some client. >>>>>>> >>>>>> Now that the server is in this state, I can very easily reproduce >>>>>> this >>>>>> on a client. I've installed the 6.14.6 kernel on a Rocky 9 client. >>>>>> >>>>>> 1. On a different machine, create an empty 3M file using "dd if=/dev/ >>>>>> zero of=3M bs=3M count=1" >>>>>> >>>>>> 2. Reboot the Rocky 9 client and log in as root. Verify that there >>>>>> are >>>>>> no active NFS mounts to the server. Start dstat and watch the output. >>>>>> >>>>>> 3. From the machine where you created the 3M file, scp the 3M file to >>>>>> the Rocky 9 client in a location that is an NFS mount of the server. >>>>>> In this case it's my home directory which is automounted. >>>>> I've reproduced the issue with rpcdebug on for rpc and nfs calls (see >>>>> attachment). >>>>>> The file copies normally, but when you look at the amount of data >>>>>> transferred out of the client to the server it seems more than the 3M >>>>>> file size. >>>>> The client seems to copy the file twice in the initial copy. The first >>>>> line on line 13623, which results in a lot of commit mismatch error >>>>> messages. >>>>> >>>>> Then again on line 13842 which results in the same commit mismatch >>>>> errors. >>>>> >>>>> These two attempts happen without any delay. This confirms my previous >>>>> observation that the outbound traffic to the server is twice the file >>>>> size. >>>>> >>>>> Then there's an NFS release call on the file. >>>>> >>>>> 30s later on line 14106, there's another attempt to write the file. >>>>> This >>>>> again results in the same commit mismatch errors. >>>>> >>>>> This process repeats itself every 30s. >>>>> >>>>> So it seems the server always returns a mismatch? Now, how can I solve >>>>> this situation? I've tried rebooting the server last night, but the >>>>> situation reappears as soon as clients start to perform writes. >>>> Usually the write verifier will mismatch only after a server restart. >>>> >>>> However, there are some other rare cases where NFSD will bump the >>>> write verifier. If an error occurs when the server tries to sync >>>> unstable NFS WRITEs to persistent storage, NFSD will change the >>>> write verifier to force the client to send the write payloads again. >>>> >>>> A writeback error might include a failing disk or a full file system, >>>> so that's the first place you should look. >>>> >>>> >>>> But why the clients don't switch to FILE_SYNC when retrying the >>>> writes is still a mystery. When they do that, the disk errors will >>>> be exposed to the client and application and you can figure out >>>> immediately what is going wrong. >>>> >>> There are no indications of a failing disk on the system (and the disks >>> are FC attached SAN disks) and the file systems that have the high write >>> I/O have sufficient free space available. Or can a "disk full" message >>> also be caused by disk quota being exceeded? As we do use disk quotas. >> That seems like something to explore. > > It's also strange that this would affect clients that are writing to the > same NFS filesystem but as a user that doesn't have any quota limits > exceeded, no? Or does the server interpret the "quota exceeded" for one > user on that filesystem as a global error for that filesystem? > > >> >> The problem is that the NFS protocol does not have a mechanism to expose >> write errors that occur on the server after it responds to an NFS >> UNSTABLE WRITE: NFS_OK, we received your data, but before the COMMIT >> occurs. >> >> When a COMMIT fails in this way, clients are supposed to change to >> FILE_SYNC and try the writes again. A FILE_SYNC WRITE flushes all the >> way to disk so any recurring error appears as part of the NFS >> operation's status code. The client is supposed to treat this as a >> permanent error and stop the loop. >> > Then there's probably a bug in the client code somewhere as the > client(s) did not do that... >>> Based on your last paragraph I conclude this is a client side issue? The >>> client should switch to FILE_SYNC instead? We do export the NFS share >>> "async". Does that make a difference? >> I don't know, because anyone who uses async is asking for trouble >> so we don't test it as an option that should be deployed in a >> production environment. All I can say is "don't do that." >> >> >>> So it's a normal operation for the server to change the write verifier? >> It's not a protocol compliance issue, if that's what you mean. Clients >> are supposed to be prepared for a write verifier change on COMMIT, full >> stop. That's why the verifier is there in the protocol. >> >> >>> The clients that showed this behavior ran a lot of different kernel >>> versions, from the RHEL 8/9 kernels, the Debian 12 (6.1 series), Fedora >>> 42 kernel and the 6.14.6 kernel on a Rocky 9 userland. So this must be >>> an issue that is present in the client code for a very long time now. >>> >>> Since approx 14:00 the issue has suddenly disappeared as suddenly as it >>> started. I can no longer reproduce it now. >> Then hitting a capacity or metadata quota might be the proximate cause. >> >> If this is an NFSv4.2 mount, is it possible that the clients are >> trying to do a large COPY operation but falling back to read/write? >> > All mounts are indeed 4.2 mounts. It's possible that some clients were > doing COPY operations, but when I reproduced the issue on my test client > it did not perform a COPY operation (as far as I understand the pcap/ > rpcdebug output). > > Are there any debugging steps we can take when this issue happens again? Since v5.17, there is a trace point in NFSD called trace_nfsd_writeverf_reset. I really don't have much experience to share regarding your filesystem and storage specific failures. You will have to use Google and StackOverflow for those details. -- Chuck Lever