Re: [PATCH] nfsd: remove long-standing revoked delegations by force

"zhangjian (CG)" <zhangjian496@xxxxxxxxxx> · Wed, 3 Sep 2025 18:06:43 +0800

On 2025/9/3 14:45, Li Lingfeng wrote:
> Hi,
> 
> 在 2025/9/3 11:46, zhangjian (CG) 写道:
>> Hello every experts.
>>
>> If we can see all delegations on hard-mounted nfs client, which are also
>> on server cl_revoked list, changed from
>> NFS_DELEGATION_RETURN_IF_CLOSED|NFS_DELEGATION_REVOKED|
>> NFS_DELEGATION_TEST_EXPIRED
>> to NFS_DELEGATION_RETURN_IF_CLOSED|NFS_DELEGATION_REVOKED, can we give
>> some hypothesis on this problem ?
>>
>> By the way, this problem can be cover over by decreasing file count on
>> server.
>>
>> Thanks,
>> zhangjian
> I think NFS_DELEGATION_TEST_EXPIRED is cleared as follows:
> nfs4_state_manager
>  nfs4_do_reclaim
>   nfs4_reclaim_open_state
>    __nfs4_reclaim_open_state // get nfs4_state from sp->so_states
>     nfs41_open_expired // status = ops->recover_open
>      nfs41_check_delegation_stateid
>       test_and_clear_bit // NFS_DELEGATION_TEST_EXPIRED
> After the bug in [1] is triggered, although the delegation is no longer on
> server->delegations, it can still be obtained by traversing sp->so_states.
> However, I cannot find the connection between the number of files on the
> server and this issue.
> 
> Thanks,
> Lingfeng
> 
Thanks a lot.

NFS_DELEGATION_TEST_EXPIRED can only be set when
delegation->stateid.type != NFS4_INVALID_STATEID_TYPE. But when
NFS_DELEGATION_REVOKED is set, delegation->stateid.type will be
NFS4_INVALID_STATEID_TYPE in nfs_mark_delegation_revoked.
This implies the order could be like:
1. Deleg A is in server cl_revoked list
2. Deleg B is marked as NFS_DELEGATION_TEST_EXPIRED in client
3. Deleg B is revoked by server callback procedure and server meet [1].
deleg B is added to cl_revoked list
4. Deleg B is marked as NFS_DELEGATION_REVOKED in client

Why the first deleg A is in server cl_revoked list? Is [1] only
condition? Why this can only happen when file count is large.
I used to see 700 delegations in server but 40w+ delegations in client.
May this give some clue on the problem?
>>
>> On 2025/9/2 20:43, Benjamin Coddington wrote:
>>> On 2 Sep 2025, at 8:10, Li Lingfeng wrote:
>>>
>>>> Our expected outcome was that the client would release the abnormal
>>>> delegation via TEST_STATEID/FREE_STATEID upon detecting its invalidity.
>>>> However, this problematic delegation is no longer present in the
>>>> client's server->delegations list—whether due to client-side
>>>> timeouts or
>>>> the server-side bug [1].
>>> How does the client timeout TEST_STATEID - are you mounting with 'soft'?
>>>
>>> We should find the server-side bug and fix it rather than write code to
>>> paper over it.  I do think the synchronization of state here is a bit
>>> fragile and wish the protocol had a generation, sequence, or marker for
>>> setting SEQ4_STATUS_ bits..
>>>
>>>>> Should we instead just administratively evict the client since it's
>>>>> clearly not behaving right in this case?
>>>> Thanks for the suggestion. While administratively evicting the
>>>> client would
>>>> certainly resolve the immediate delegation issue, I'm concerned that
>>>> approach
>>>> might be a bit heavy-handed.
>>>> The problematic behavior seems isolated to a single delegation.
>>>> Meanwhile,
>>>> the client itself likely has numerous other open files and active
>>>> state on
>>>> the server. Forcing a complete client reconnect would tear down all
>>>> that
>>>> state, which could cause significant application disruption and be
>>>> perceived
>>>> as a service outage from the client's perspective.
>>>>
>>>> [1] https://lore.kernel.org/all/de669327-c93a-49e5-a53b-
>>>> bda9e67d34a2@xxxxxxxxxx/
>>> ^^ in this thread you reference v5.10 - there was a knfsd fix for a
>>> cl_revoked leak "3b816601e279", and there have been 3 or 4 fixes to fix
>>> problems and optimize the client walk of delegations since then.  Jeff
>>> pointed out that there have been fixes in these areas.  Are you
>>> finding this
>>> problem still with all those fixes included?
>>>
>>> Ben
>>>
>>>
>