I posted a different patch with the suggested approach. Tigran. ----- Original Message ----- > From: "Trond Myklebust" <trondmy@xxxxxxxxxx> > To: "Tigran Mkrtchyan" <tigran.mkrtchyan@xxxxxxx>, "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx> > Cc: "Anna Schumaker" <anna@xxxxxxxxxx> > Sent: Wednesday, 25 June, 2025 21:39:15 > Subject: Re: [PATCH 1/1] pNFS/flexfiles: mark device unavailable on fatal connection error > On Wed, 2025-06-25 at 21:19 +0200, Mkrtchyan, Tigran wrote: >> >> Hi Folks, >> >> Do you have any opinion on this one? Would you like me to address it >> differently? >> > > I don't think we should mark the device as being unavailable just > because someone signalled the RPC task. > > It would be better to have nfs4_ff_layout_prepare_ds() return any fatal > errors that it encounters using ERR_PTR(), so that the callers can > handle them. Then maybe return ERR_PTR(-EAGAIN) for the case where we > currently return NULL so that those callers don't have to use the hated > IS_ERR_OR_NULL() test. > >> Tigran. >> >> ----- Original Message ----- >> > From: "Tigran Mkrtchyan" <tigran.mkrtchyan@xxxxxxx> >> > To: "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx> >> > Cc: "Trond Myklebust" <trondmy@xxxxxxxxxx>, "Anna Schumaker" >> > <anna@xxxxxxxxxx>, "Tigran Mkrtchyan" >> > <tigran.mkrtchyan@xxxxxxx> >> > Sent: Monday, 9 June, 2025 23:43:03 >> > Subject: [PATCH 1/1] pNFS/flexfiles: mark device unavailable on >> > fatal connection error >> >> > Fixes: 260f32adb88 ("pNFS/flexfiles: Check the result of >> > nfs4_pnfs_ds_connect") >> > >> > When an applications get killed (SIGTERM/SIGINT) while pNFS client >> > performs a >> > connection >> > to DS, client ends in an infinite loop of connect-disconnect. This >> > source of the issue, it that >> > flexfilelayoutdev#nfs4_ff_layout_prepare_ds gets an >> > error >> > on nfs4_pnfs_ds_connect with status ERESTARTSYS, which is set by >> > rpc_signal_task, but >> > the error is treated as transient, thus retried. >> > >> > The issue is reproducible with script as (there should be ~1000 >> > files in >> > a directory, client should must not have any connections to DSes): >> > >> > ``` >> > echo 3 > /proc/sys/vm/drop_caches >> > >> > for i in * >> > do >> > head -1 $i & >> > PP=$! >> > sleep 10e-03 >> > kill -TERM $PP >> > done >> > ``` >> > >> > Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@xxxxxxx> >> > --- >> > fs/nfs/flexfilelayout/flexfilelayoutdev.c | 4 ++++ >> > 1 file changed, 4 insertions(+) >> > >> > diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c >> > b/fs/nfs/flexfilelayout/flexfilelayoutdev.c >> > index 4a304cf17c4b..0008a8180c9b 100644 >> > --- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c >> > +++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c >> > @@ -410,6 +410,10 @@ nfs4_ff_layout_prepare_ds(struct >> > pnfs_layout_segment *lseg, >> > mirror->mirror_ds->ds_versions[0].wsize = >> > max_payload; >> > goto out; >> > } >> > + /* There is a fatal error to connect to DS. Mark it >> > unavailable to avoid >> > infinite retry loop. */ >> > + if (nfs_error_is_fatal(status)) >> > + nfs4_mark_deviceid_unavailable(&mirror->mirror_ds- >> > >id_node); >> > + >> > noconnect: >> > ff_layout_track_ds_error(FF_LAYOUT_FROM_HDR(lseg- >> > >pls_layout), >> > mirror, lseg->pls_range.offset, >> > -- >> > 2.49.0 > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trondmy@xxxxxxxxxx, trond.myklebust@xxxxxxxxxxxxxxx
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature