Re: [RFC PATCH 1/1] nfsd: delay re-registering of listeners after listener removal

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 25 Aug 2025 10:58:23 -0400

On 8/23/25 7:52 PM, NeilBrown wrote:
> On Fri, 22 Aug 2025, Chuck Lever wrote:
>> On 8/21/25 4:43 PM, Olga Kornievskaia wrote:
>>> This patch tries to address the following failure:
>>> nfsdctl threads 0
>>> nfsdctl listener +rdma::20049
>>> nfsdctl listener +tcp::2049
>>> nfsdctl listener -tcp::2049
>>> nfsdctl: Error: Cannot assign requested address
>>>
>>> The reason for the failure is due to the fact that socket cleanup only
>>> happens in __svc_rdma_free() which is a deferred work triggers when an
>>> rdma transport is destroyed. To remove a listener nfsdctl is forced to
>>> first remove all transports via svc_xprt_destroy_all() and then re-add
>>> the ones that are left. Due to the fact that there isn't a way to
>>> delete a particular entry from a list where permanent sockets are
>>> stored.
>>
>> The issue is specifically with llist, which does not permit the
>> deletion of any entry other than the first on the list.
> 
> That is true but not really relevant.
> We can remove everything, walk the resulting list removing individual
> things, an requeue everything we didn't want to remove.
> This is exactly what svc_clean_up_xprts() does.

The semantics of svc_xprt_destroy_all() is to actually destroy all
transports, not just dequeue them. It targets both listeners and active
connections.

Couldn't that result in unwanted disruption of NFS client activity if
the NFS server still has threads and other listeners?

We want to be able to remove an arbitrary listener (and I will include
the UDP socket in this) while the NFS server is still operational. The
goal is maintaining server availability while enabling administrators to
adjust the server's network configuration.

I'm wondering why we can't use a simpler mechanism here such as simply
calling svc_xprt_deferred_close() on a specific xprt, after having
provided a reliable facility to wait for an individual transport to shut
down.

svc_xprt_destroy_all() seems appropriate for full server shutdown, but
it doesn't align well with deleting a single listener.

> To remove a signle xprt We just need to set XPT_CLOSE, call
> svc_delete_xprt() then run svc_clean_up_xprts().

svc_clean_up_xprts() already sets XPT_CLOSE and calls svc_delete_xprt().
I think you are suggesting a new API that does what svc_clean_up_xprts()
does, but only to a single specific xprt.

Is there also some expectation that, on return from
svc_xprt_destroy_all(), the doomed xprt is fully gone? If that is the
case, the API contract for svc_xprt_destroy_all() should state that it
is indeed a synchronous call that returns only when all transports
associated with { @serv, @net } have been released.

And in that case, the RDMA transport has always been a lingerer. We have
never noticed it because svc_xprt_destroy_all() is currently done only
on full server shutdown.

> I think svc_clean_up_xprts() should call svc_pool_wake_idle_thread()
> after lwq_enqueue_batch() in case one of the xprts that we kept became
> ready while it was temporarily not on the queue.

> Thanks,
> NeilBrown
> 
>>
>>
>>> Going back to the deferred work done in __svc_rdma_free(), the
>>> work might not get to run before nfsd_nl_listener_set_doit() creates
>>> the new transports. As a result, it finds that something is still
>>> listening of the rdma port and rdma_bind_addr() fails.
>>>
>>> Proposed solution is to add a delay after svc_xprt_destroy_all() to
>>> allow for the deferred work to run.
>>>
>>> --- Is the chosen value of 1s enough to ensure socket goes away?
>>> I can't guarantee that.
>>
>> Adding a sleep and hoping it works is ... not a proper fix. The
>> msleep() in svc_xprt_destroy_all() is part of a polling loop,
>> and it is only waiting for the xprt lists to become empty. You're
>> not polling here (ie, checking for completion before sleeping).
>>
>>
>>> --- Alternatives that i can think of:
>>> (1) to go back to listener removal approach that added removal of
>>> entry to the llist api. That would not require a removal of all
>>> transport causing this problem to occur. Earlier it was preferred
>>> not to change llist api.
>>> (2) some method of checking that all deferred work occuring in
>>> svc_xprt_destroy_all() completed.
>>
>> Jeff (and perhaps Lorenzo) need to go back to the original reasons why
>> this was done and rework it. I think we were avoiding holding the
>> nfsd mutex in here?
>>
>> Complete shutdown of a transport always involve some deferred
>> activity, and there's a reference count involved as well. I can't
>> see how the current destroy/re-insert mechanism can be made reliable.
>>
>>
>>> Fixes: d093c90892607 ("nfsd: fix management of listener transports")
>>> Signed-off-by: Olga Kornievskaia <okorniev@xxxxxxxxxx>
>>> ---
>>>  fs/nfsd/nfsctl.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
>>> index dd3267b4c203..f9f5670abcc3 100644
>>> --- a/fs/nfsd/nfsctl.c
>>> +++ b/fs/nfsd/nfsctl.c
>>> @@ -1998,8 +1998,10 @@ int nfsd_nl_listener_set_doit(struct sk_buff *skb, struct genl_info *info)
>>>  	 * Since we can't delete an arbitrary llist entry, destroy the
>>>  	 * remaining listeners and recreate the list.
>>>  	 */
>>> -	if (delete)
>>> +	if (delete) {
>>>  		svc_xprt_destroy_all(serv, net, false);
>>> +		ssleep(1);
>>> +	}
>>>  
>>>  	/* walk list of addrs again, open any that still don't exist */
>>>  	nlmsg_for_each_attr(attr, info->nlhdr, GENL_HDRLEN, rem) {
>>
>>
>> -- 
>> Chuck Lever
>>
> 

-- 
Chuck Lever