On Wed, 11 Jun 2025, Al Viro wrote: > On Mon, Jun 09, 2025 at 05:34:10PM +1000, NeilBrown wrote: > > Once we support directory operations (e.g. create) without requiring the > > parent to be locked, the current practice locking a directory while > > processing rmdir() or similar will not be sufficient to wait for > > operations to complete and to block further operations. > > > > This patch introduced a new inode flag S_DYING. It indicates that > > a rmdir or similar is being processed and new directory operations must > > not commence in the directory. They should not abort either as the > > rmdir might fail - instead they should block. They can do this by > > waiting for a lock on the inode. > > > > A new interface rmdir_lock() locks the inode, sets this flag, and waits > > for any children with DCACHE_LOCK set to complete their operation, and > > for any d_in_lookup() children to complete the lookup. It should be > > called before attempted to delete the directory or set S_DEAD. Matching > > rmdir_unlock() clears the flag and unlocks the inode. > > > > dentry_lock() and d_alloc_parallel() are changed to block while this > > flag it set and to fail if the parent IS_DEADDIR(), though dentry_lock() > > doesn't block for d_in_lookup() dentries. > > > diff --git a/fs/namei.c b/fs/namei.c > > index 4ad76df21677..c590f25d0d49 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -1770,8 +1770,11 @@ static bool __dentry_lock(struct dentry *dentry, > > struct dentry *base, const struct qstr *last, > > unsigned int subclass, int state) > > { > > + struct dentry *parent; > > + struct inode *dir; > > int err; > > > > +retry: > > lock_acquire_exclusive(&dentry->dentry_map, subclass, 0, NULL, _THIS_IP_); > > spin_lock(&dentry->d_lock); > > err = wait_var_event_any_lock(&dentry->d_flags, > > @@ -1782,10 +1785,43 @@ static bool __dentry_lock(struct dentry *dentry, > > spin_unlock(&dentry->d_lock); > > return false; > > } > > - > > - dentry->d_flags |= DCACHE_LOCK; > > + parent = dentry->d_parent; > > Why will it stay the parent? Matter of fact, why will it stay positive? As long as we continue to hold dentry->d_lock it will stay the parent, and so will have a reference, and so will stay positive. > > > + dir = igrab(parent->d_inode); > > ... and not oops right here? Still holding dentry->d_lock here so parent cannot have changed. > > > + lock_map_release(&dentry->dentry_map); > > spin_unlock(&dentry->d_lock); > > - return true; > > + > > + if (state == TASK_KILLABLE) { > > + err = down_write_killable(&dir->i_rwsem); > > + if (err) { > > + iput(dir); > > + return false; > > + } > > + } else > > + inode_lock(dir); > > + /* S_DYING much be clear now */ > > + inode_unlock(dir); > > + iput(dir); > > + goto retry; > > OK, I'm really confused now. Is it allowed to call dentry_lock() while holding > ->i_rwsem of the parent? Yes. > > Where does your dentry lock nest wrt ->i_rwsem? As a bonus (well, malus, I guess) > question, where does it nest wrt parent *and* child inodes' ->i_rwsem for rmdir > and rename? Between inode of parent of the dentry and inode of the dentry. In this case we aren't holding the dentry lock when we lock the parent. We might already have the parent locked when calling dentry_lock() if the filesystem hasn't opted out) but in that case S_DYING will not be set. It is only set while holding the i_rw_sem in case which doesn't lock dentries. So if S_DYING is set here, then it must be safe to lock the parent. > > Tangentially connected question: which locks are held for ->unlink() in your > scheme? You do need *something* on the victim inode to protect ->i_nlink > modifications, and anything on dentries of victim or their parent directories > is not going to give that. > I haven't change the locking on non-directories at all. The target of ->unlink() will be locked after the dentry is locked (which is after the parent is locked if the fs requires that). ->i_nlink for non-directories is still protected by ->i_rwsem on the inode. ->i_nlink for directories is something the fs will have to handle when opting out of i_rwsem on directory ops. NFS, for example, already takes inode->i_lock when calling inc_nlink() or drop_nlink(). The set_nlink() in nfs_update_inode() isn't obviously protected but as the number is informational it possibly doesn't matter. Thanks, NeilBrown