Re: [PATCH] nfsd: remove long-standing revoked delegations by force

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

在 2025/9/3 18:06, zhangjian (CG) 写道:

On 2025/9/3 14:45, Li Lingfeng wrote:
Hi,

在 2025/9/3 11:46, zhangjian (CG) 写道:
Hello every experts.

If we can see all delegations on hard-mounted nfs client, which are also
on server cl_revoked list, changed from
NFS_DELEGATION_RETURN_IF_CLOSED|NFS_DELEGATION_REVOKED|
NFS_DELEGATION_TEST_EXPIRED
to NFS_DELEGATION_RETURN_IF_CLOSED|NFS_DELEGATION_REVOKED, can we give
some hypothesis on this problem ?

By the way, this problem can be cover over by decreasing file count on
server.

Thanks,
zhangjian
I think NFS_DELEGATION_TEST_EXPIRED is cleared as follows:
nfs4_state_manager
  nfs4_do_reclaim
   nfs4_reclaim_open_state
    __nfs4_reclaim_open_state // get nfs4_state from sp->so_states
     nfs41_open_expired // status = ops->recover_open
      nfs41_check_delegation_stateid
       test_and_clear_bit // NFS_DELEGATION_TEST_EXPIRED
After the bug in [1] is triggered, although the delegation is no longer on
server->delegations, it can still be obtained by traversing sp->so_states.
However, I cannot find the connection between the number of files on the
server and this issue.

Thanks,
Lingfeng

Thanks a lot.

NFS_DELEGATION_TEST_EXPIRED can only be set when
delegation->stateid.type != NFS4_INVALID_STATEID_TYPE. But when
NFS_DELEGATION_REVOKED is set, delegation->stateid.type will be
NFS4_INVALID_STATEID_TYPE in nfs_mark_delegation_revoked.
This implies the order could be like:
1. Deleg A is in server cl_revoked list
2. Deleg B is marked as NFS_DELEGATION_TEST_EXPIRED in client
3. Deleg B is revoked by server callback procedure and server meet [1].
deleg B is added to cl_revoked list
4. Deleg B is marked as NFS_DELEGATION_REVOKED in client
I think Deleg A was added to the server's cl_revoked list due to [1]. For
the file corresponding to Deleg B, no access conflict occurred, which
means no deleg return was triggered. Therefore, unlike Deleg A, it would
not go through the process of nfs4_delegreturn_done -->
nfs_delegation_mark_returned --> nfs_mark_delegation_revoked to be set
with NFS4_INVALID_STATEID_TYPE, and thus could be flagged with
NFS_DELEGATION_TEST_EXPIRED.
Why the first deleg A is in server cl_revoked list? Is [1] only
condition? Why this can only happen when file count is large.
I used to see 700 delegations in server but 40w+ delegations in client.
May this give some clue on the problem?
I'm afraid I cannot explain why there is such a significant discrepancy in
the number of delegations between the client and the server. I truly don't
know what is happening.

Thanks,
Lingfeng

On 2025/9/2 20:43, Benjamin Coddington wrote:
On 2 Sep 2025, at 8:10, Li Lingfeng wrote:

Our expected outcome was that the client would release the abnormal
delegation via TEST_STATEID/FREE_STATEID upon detecting its invalidity.
However, this problematic delegation is no longer present in the
client's server->delegations list—whether due to client-side
timeouts or
the server-side bug [1].
How does the client timeout TEST_STATEID - are you mounting with 'soft'?

We should find the server-side bug and fix it rather than write code to
paper over it.  I do think the synchronization of state here is a bit
fragile and wish the protocol had a generation, sequence, or marker for
setting SEQ4_STATUS_ bits..

Should we instead just administratively evict the client since it's
clearly not behaving right in this case?
Thanks for the suggestion. While administratively evicting the
client would
certainly resolve the immediate delegation issue, I'm concerned that
approach
might be a bit heavy-handed.
The problematic behavior seems isolated to a single delegation.
Meanwhile,
the client itself likely has numerous other open files and active
state on
the server. Forcing a complete client reconnect would tear down all
that
state, which could cause significant application disruption and be
perceived
as a service outage from the client's perspective.

[1] https://lore.kernel.org/all/de669327-c93a-49e5-a53b-
bda9e67d34a2@xxxxxxxxxx/
^^ in this thread you reference v5.10 - there was a knfsd fix for a
cl_revoked leak "3b816601e279", and there have been 3 or 4 fixes to fix
problems and optimize the client walk of delegations since then.  Jeff
pointed out that there have been fixes in these areas.  Are you
finding this
problem still with all those fixes included?

Ben







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux