On 2025/9/3 14:45, Li Lingfeng wrote: > Hi, > > 在 2025/9/3 11:46, zhangjian (CG) 写道: >> Hello every experts. >> >> If we can see all delegations on hard-mounted nfs client, which are also >> on server cl_revoked list, changed from >> NFS_DELEGATION_RETURN_IF_CLOSED|NFS_DELEGATION_REVOKED| >> NFS_DELEGATION_TEST_EXPIRED >> to NFS_DELEGATION_RETURN_IF_CLOSED|NFS_DELEGATION_REVOKED, can we give >> some hypothesis on this problem ? >> >> By the way, this problem can be cover over by decreasing file count on >> server. >> >> Thanks, >> zhangjian > I think NFS_DELEGATION_TEST_EXPIRED is cleared as follows: > nfs4_state_manager > nfs4_do_reclaim > nfs4_reclaim_open_state > __nfs4_reclaim_open_state // get nfs4_state from sp->so_states > nfs41_open_expired // status = ops->recover_open > nfs41_check_delegation_stateid > test_and_clear_bit // NFS_DELEGATION_TEST_EXPIRED > After the bug in [1] is triggered, although the delegation is no longer on > server->delegations, it can still be obtained by traversing sp->so_states. > However, I cannot find the connection between the number of files on the > server and this issue. > > Thanks, > Lingfeng > Thanks a lot. NFS_DELEGATION_TEST_EXPIRED can only be set when delegation->stateid.type != NFS4_INVALID_STATEID_TYPE. But when NFS_DELEGATION_REVOKED is set, delegation->stateid.type will be NFS4_INVALID_STATEID_TYPE in nfs_mark_delegation_revoked. This implies the order could be like: 1. Deleg A is in server cl_revoked list 2. Deleg B is marked as NFS_DELEGATION_TEST_EXPIRED in client 3. Deleg B is revoked by server callback procedure and server meet [1]. deleg B is added to cl_revoked list 4. Deleg B is marked as NFS_DELEGATION_REVOKED in client Why the first deleg A is in server cl_revoked list? Is [1] only condition? Why this can only happen when file count is large. I used to see 700 delegations in server but 40w+ delegations in client. May this give some clue on the problem? >> >> On 2025/9/2 20:43, Benjamin Coddington wrote: >>> On 2 Sep 2025, at 8:10, Li Lingfeng wrote: >>> >>>> Our expected outcome was that the client would release the abnormal >>>> delegation via TEST_STATEID/FREE_STATEID upon detecting its invalidity. >>>> However, this problematic delegation is no longer present in the >>>> client's server->delegations list—whether due to client-side >>>> timeouts or >>>> the server-side bug [1]. >>> How does the client timeout TEST_STATEID - are you mounting with 'soft'? >>> >>> We should find the server-side bug and fix it rather than write code to >>> paper over it. I do think the synchronization of state here is a bit >>> fragile and wish the protocol had a generation, sequence, or marker for >>> setting SEQ4_STATUS_ bits.. >>> >>>>> Should we instead just administratively evict the client since it's >>>>> clearly not behaving right in this case? >>>> Thanks for the suggestion. While administratively evicting the >>>> client would >>>> certainly resolve the immediate delegation issue, I'm concerned that >>>> approach >>>> might be a bit heavy-handed. >>>> The problematic behavior seems isolated to a single delegation. >>>> Meanwhile, >>>> the client itself likely has numerous other open files and active >>>> state on >>>> the server. Forcing a complete client reconnect would tear down all >>>> that >>>> state, which could cause significant application disruption and be >>>> perceived >>>> as a service outage from the client's perspective. >>>> >>>> [1] https://lore.kernel.org/all/de669327-c93a-49e5-a53b- >>>> bda9e67d34a2@xxxxxxxxxx/ >>> ^^ in this thread you reference v5.10 - there was a knfsd fix for a >>> cl_revoked leak "3b816601e279", and there have been 3 or 4 fixes to fix >>> problems and optimize the client walk of delegations since then. Jeff >>> pointed out that there have been fixes in these areas. Are you >>> finding this >>> problem still with all those fixes included? >>> >>> Ben >>> >>> >