Hi Lukáš On 6/3/25 7:57 AM, Lukáš Hejtmánek wrote: > > Hello, > > We are experiencing repeated PostgreSQL process freezes when using NFSv3-mounted storage on clients running Ubuntu kernels in the 6.x series (specifically tested on 6.2, 6.8, and 6.11). > > Setup: > - Storage: All-flash disk array exported via NFS (v3) > - Client OS: Ubuntu with kernel versions 6.2, 6.8, 6.11 > - Application: PostgreSQL using the NFS volume as its data directory > > Symptoms: > On affected systems, PostgreSQL processes (particularly autovacuum workers) intermittently hang. > The stack trace shows a consistent pattern involving __nfs_lookup_revalidate: > [<0>] __nfs_lookup_revalidate+0x113/0x160 [nfs] > [<0>] nfs_lookup_revalidate+0x15/0x30 [nfs] > [<0>] lookup_fast+0x87/0x100 > [<0>] open_last_lookups+0x5f/0x400 > [<0>] path_openat+0x99/0x2d0 > [<0>] do_filp_open+0xaf/0x170 > [<0>] do_sys_openat2+0xb3/0xe0 > [<0>] __x64_sys_openat+0x55/0xa0 > [<0>] x64_sys_call+0x1eb1/0x25a0 > [<0>] do_syscall_64+0x7f/0x180 > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 > > Process Tree Example: > 863813 ? Zsl 12:24 \_ [manager] <defunct> > 3644504 ? Ds 0:00 \_ postgres: mlflow: autovacuum worker template1 > > The autovacuum worker is most commonly affected. > > Workaround Attempt: > We observed some improvement by modifying the NFS client source fs/nfs/dir.c (around line 1833): > > Change: > dentry->d_fsdata = NFS_FSDATA_BLOCKED; > > To: > smp_store_release(&dentry->d_fsdata, NFS_FSDATA_BLOCKED); > > While this mitigates the issue somewhat, it does not fully resolve the hangs. > > Is this a known issue with NFS in 6.x kernels? > Is there a recommended patch or workaround? > Are there any known regressions related to __nfs_lookup_revalidate or dentry locking? I'm not aware of this being a known issue or any regressions in __nfs_lookup_revalidate(), so I can't recommend a patch or workaround to try. Have you tried an upstream kernel to verify if it's still an issue there? There were a handful of patches that went into v6.14 that touch the lookup path, and I'm curious if they make a difference either way. Anna > > Problem can be related to the all-flash array, that is able to provide about 30k IOPS over NFS and 5633 TPS in pgbench (pgbench -T 300 -c100 -j20 -r). > > Other NFS connections to the same NFS servers are not affected and are usable, however, the process cannot be kille obviously and the client node reboot is required. > > I believe that in 5.x kernel series it was more stable. > > -- > Lukáš Hejtmánek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title >