Re: [PATCH 1/1] pNFS/flexfiles: mark device unavailable on fatal connection error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I posted a different patch with the suggested approach. 

Tigran.

----- Original Message -----
> From: "Trond Myklebust" <trondmy@xxxxxxxxxx>
> To: "Tigran Mkrtchyan" <tigran.mkrtchyan@xxxxxxx>, "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx>
> Cc: "Anna Schumaker" <anna@xxxxxxxxxx>
> Sent: Wednesday, 25 June, 2025 21:39:15
> Subject: Re: [PATCH 1/1] pNFS/flexfiles: mark device unavailable on fatal connection error

> On Wed, 2025-06-25 at 21:19 +0200, Mkrtchyan, Tigran wrote:
>> 
>> Hi Folks,
>> 
>> Do you have any opinion on this one? Would you like me to address it
>> differently?
>> 
> 
> I don't think we should mark the device as being unavailable just
> because someone signalled the RPC task.
> 
> It would be better to have nfs4_ff_layout_prepare_ds() return any fatal
> errors that it encounters using ERR_PTR(), so that the callers can
> handle them. Then maybe return ERR_PTR(-EAGAIN) for the case where we
> currently return NULL so that those callers don't have to use the hated
> IS_ERR_OR_NULL() test.
> 
>> Tigran.
>> 
>> ----- Original Message -----
>> > From: "Tigran Mkrtchyan" <tigran.mkrtchyan@xxxxxxx>
>> > To: "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx>
>> > Cc: "Trond Myklebust" <trondmy@xxxxxxxxxx>, "Anna Schumaker"
>> > <anna@xxxxxxxxxx>, "Tigran Mkrtchyan"
>> > <tigran.mkrtchyan@xxxxxxx>
>> > Sent: Monday, 9 June, 2025 23:43:03
>> > Subject: [PATCH 1/1] pNFS/flexfiles: mark device unavailable on
>> > fatal connection error
>> 
>> > Fixes: 260f32adb88 ("pNFS/flexfiles: Check the result of
>> > nfs4_pnfs_ds_connect")
>> > 
>> > When an applications get killed (SIGTERM/SIGINT) while pNFS client
>> > performs a
>> > connection
>> > to DS, client ends in an infinite loop of connect-disconnect. This
>> > source of the issue, it that
>> > flexfilelayoutdev#nfs4_ff_layout_prepare_ds gets an
>> > error
>> > on nfs4_pnfs_ds_connect with status ERESTARTSYS, which is set by
>> > rpc_signal_task, but
>> > the error is treated as transient, thus retried.
>> > 
>> > The issue is reproducible with script as (there should be ~1000
>> > files in
>> > a directory, client should must not have any connections to DSes):
>> > 
>> > ```
>> > echo 3 > /proc/sys/vm/drop_caches
>> > 
>> > for i in *
>> > do
>> >        head -1 $i &
>> >        PP=$!
>> >        sleep 10e-03
>> >        kill -TERM $PP
>> > done
>> > ```
>> > 
>> > Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@xxxxxxx>
>> > ---
>> > fs/nfs/flexfilelayout/flexfilelayoutdev.c | 4 ++++
>> > 1 file changed, 4 insertions(+)
>> > 
>> > diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c
>> > b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
>> > index 4a304cf17c4b..0008a8180c9b 100644
>> > --- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c
>> > +++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
>> > @@ -410,6 +410,10 @@ nfs4_ff_layout_prepare_ds(struct
>> > pnfs_layout_segment *lseg,
>> > 			mirror->mirror_ds->ds_versions[0].wsize =
>> > max_payload;
>> > 		goto out;
>> > 	}
>> > +	/* There is a fatal error to connect to DS. Mark it
>> > unavailable to avoid
>> > infinite retry loop. */
>> > +	if (nfs_error_is_fatal(status))
>> > +		nfs4_mark_deviceid_unavailable(&mirror->mirror_ds-
>> > >id_node);
>> > +
>> > noconnect:
>> > 	ff_layout_track_ds_error(FF_LAYOUT_FROM_HDR(lseg-
>> > >pls_layout),
>> > 				 mirror, lseg->pls_range.offset,
>> > --
>> > 2.49.0
> 
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trondmy@xxxxxxxxxx, trond.myklebust@xxxxxxxxxxxxxxx

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux