Re: [PATCH] NFS: Avoid flushing data while holding directory locks in nfs_rename()

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 29 Apr 2025 23:22:47 +0000

On Tue, 2025-04-29 at 12:14 -0400, Chuck Lever wrote:
> On 4/27/25 7:01 PM, trondmy@xxxxxxxxxx wrote:
> > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > 
> > The Linux client assumes that all filehandles are non-volatile for
> > renames within the same directory (otherwise sillyrename cannot
> > work).
> > However, the existence of the Linux 'subtree_check' export option
> > has
> > meant that nfs_rename() has always assumed it needs to flush writes
> > before attempting to rename.
> > 
> > Since NFSv4 does allow the client to query whether or not the
> > server
> > exhibits this behaviour, and since knfsd does actually set the
> > appropriate flag when 'subtree_check' is enabled on an export, it
> > should be OK to optimise away the write flushing behaviour in the
> > cases
> > where it is clearly not needed.
> 
> No objection to the high level goal, seems like a sensible change.
> 
> So NFSv3 still has the flushing requirement, but NFSv4 can disable
> that
> requirement as long as the server in question asserts
> FH4_VOLATILE_ANY
> and FH4_VOL_RENAME, correct?
> 
> I'm wondering how confident we are that other server implementations
> behave reasonably. Yes, they are broken if they don't behave, but
> there
> is still risk of introducing a regression.

I'd prefer to deal with that through other means if it turns out that
we have to. The problem we have right now is that we are forcing a
writeback of the entire file while holding several directory mutexes
(as well as the filesystem-global s_vfs_rename_mutex). That's terrible
for performance and scalability.

> 
> 
> > Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > ---
> >  fs/nfs/client.c           |  2 ++
> >  fs/nfs/dir.c              | 15 ++++++++++++++-
> >  include/linux/nfs_fs_sb.h |  6 ++++++
> >  3 files changed, 22 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> > index 2115c1189c2d..6d63b958c4bb 100644
> > --- a/fs/nfs/client.c
> > +++ b/fs/nfs/client.c
> > @@ -1105,6 +1105,8 @@ struct nfs_server *nfs_create_server(struct
> > fs_context *fc)
> >  		if (server->namelen == 0 || server->namelen >
> > NFS2_MAXNAMLEN)
> >  			server->namelen = NFS2_MAXNAMLEN;
> >  	}
> > +	/* Linux 'subtree_check' borkenness mandates this setting
> > */
> 
> Assuming you are going to respin this patch to deal with the build
> bot
> warnings, can you make this comment more useful? "borkenness" sounds
> like you are dealing with a server bug here, but that's not really
> the case. subtree_check is working as designed: it requires the extra
> flushing. The no_subtree_check case does not, IIUC.

subtree checking is working as designed, but that doesn't change the
fact that it is a violation of the NFSv3 protocol. You can't to change
the filehandle in mid flight in any stateless protocol, because that
will break applications.

> It would be better to explain this need strictly in terms of file
> handle
> volatility, no?
> 
> 
> > +	server->fh_expire_type = NFS_FH_VOL_RENAME;
> >  
> >  	if (!(fattr->valid & NFS_ATTR_FATTR)) {
> >  		error = ctx->nfs_mod->rpc_ops->getattr(server,
> > ctx->mntfh,
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index bd23fc736b39..d0e0b435a843 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -2676,6 +2676,18 @@ nfs_unblock_rename(struct rpc_task *task,
> > struct nfs_renamedata *data)
> >  	unblock_revalidate(new_dentry);
> >  }
> >  
> > +static bool nfs_rename_is_unsafe_cross_dir(struct dentry
> > *old_dentry,
> > +					   struct dentry
> > *new_dentry)
> > +{
> > +	struct nfs_server *server = NFS_SB(old_dentry->d_sb);
> > +
> > +	if (old_dentry->d_parent != new_dentry->d_parent)
> > +		return false;
> > +	if (server->fh_expire_type & NFS_FH_RENAME_UNSAFE)
> > +		return !(server->fh_expire_type &
> > NFS_FH_NOEXPIRE_WITH_OPEN);
> > +	return true;
> > +}
> > +
> >  /*
> >   * RENAME
> >   * FIXME: Some nfsds, like the Linux user space nfsd, may generate
> > a
> > @@ -2763,7 +2775,8 @@ int nfs_rename(struct mnt_idmap *idmap,
> > struct inode *old_dir,
> >  
> >  	}
> >  
> > -	if (S_ISREG(old_inode->i_mode))
> > +	if (S_ISREG(old_inode->i_mode) &&
> > +	    nfs_rename_is_unsafe_cross_dir(old_dentry,
> > new_dentry))
> >  		nfs_sync_inode(old_inode);
> >  	task = nfs_async_rename(old_dir, new_dir, old_dentry,
> > new_dentry,
> >  				must_unblock ? nfs_unblock_rename
> > : NULL);
> > diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
> > index 71319637a84e..6732c7e04d19 100644
> > --- a/include/linux/nfs_fs_sb.h
> > +++ b/include/linux/nfs_fs_sb.h
> > @@ -236,6 +236,12 @@ struct nfs_server {
> >  	u32			acl_bitmask;	/* V4 bitmask
> > representing the ACEs
> >  						   that are
> > supported on this
> >  						   filesystem */
> > +	/* The following #defines numerically match the NFSv4
> > equivalents */
> > +#define NFS_FH_NOEXPIRE_WITH_OPEN (0x1)
> > +#define NFS_FH_VOLATILE_ANY (0x2)
> > +#define NFS_FH_VOL_MIGRATION (0x4)
> > +#define NFS_FH_VOL_RENAME (0x8)
> > +#define NFS_FH_RENAME_UNSAFE (NFS_FH_VOLATILE_ANY |
> > NFS_FH_VOL_RENAME)
> >  	u32			fh_expire_type;	/* V4
> > bitmask representing file
> >  						   handle
> > volatility type for
> >  						   this filesystem
> > */
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx