Re: [PATCH] nfsd: remove long-standing revoked delegations by force

Li Lingfeng <lilingfeng3@xxxxxxxxxx> · Tue, 2 Sep 2025 21:08:48 +0800

Hi, Ben

在 2025/9/2 20:43, Benjamin Coddington 写道:
On 2 Sep 2025, at 8:10, Li Lingfeng wrote:

Our expected outcome was that the client would release the abnormal
delegation via TEST_STATEID/FREE_STATEID upon detecting its invalidity.
However, this problematic delegation is no longer present in the
client's server->delegations list—whether due to client-side timeouts or
the server-side bug [1].
How does the client timeout TEST_STATEID - are you mounting with 'soft'?
I have never actually encountered a timeout; on 5.10, I triggered it by
forcibly injecting a timeout error.

--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -6509,6 +6509,10 @@ static void nfs4_delegreturn_prepare(struct 
rpc_task *task, void *data)
                        &d_data->args.seq_args,
                        &d_data->res.seq_res,
                        task);
+
+       printk("%s force inject err\n", __func__);
+       task->tk_rpc_status = -ETIMEDOUT;
+       rpc_exit(task, -ETIMEDOUT);
 }
We should find the server-side bug and fix it rather than write code to
paper over it.  I do think the synchronization of state here is a bit
fragile and wish the protocol had a generation, sequence, or marker for
setting SEQ4_STATUS_ bits..
I was able to reproduce a server-side bug by adding delays (without using
fault injection). The server-side bug is detailed in reference [1].
I would appreciate it if you could provide any suggestions for 
modifications.
Should we instead just administratively evict the client since it's
clearly not behaving right in this case?
Thanks for the suggestion. While administratively evicting the client would
certainly resolve the immediate delegation issue, I'm concerned that approach
might be a bit heavy-handed.
The problematic behavior seems isolated to a single delegation. Meanwhile,
the client itself likely has numerous other open files and active state on
the server. Forcing a complete client reconnect would tear down all that
state, which could cause significant application disruption and be perceived
as a service outage from the client's perspective.

[1] https://lore.kernel.org/all/de669327-c93a-49e5-a53b-bda9e67d34a2@xxxxxxxxxx/
^^ in this thread you reference v5.10 - there was a knfsd fix for a
cl_revoked leak "3b816601e279", and there have been 3 or 4 fixes to fix
problems and optimize the client walk of delegations since then.  Jeff
pointed out that there have been fixes in these areas.  Are you finding this
problem still with all those fixes included?
As shown in [1], the problem can be reproduced at master(commit 
b320789d6883),
I think all those fixes are included.

Thanks,
Lingfeng


Ben