[PATCH 0/8 preview] demonstrate proposed new locking strategy for directories

NeilBrown <neil@xxxxxxxxxx> · Mon, 9 Jun 2025 17:34:05 +1000

This patches are still under development.  In particular some proper
documentation is needed.  They are sufficient to demonstrate my design.

They add an alternate mechanism for providing the locking that the VFS
needs for directory operations.  This includes:
 - only one operation per name at a time
 - no operations in a directory being removed
 - no concurrent cross-directory renames which might result in an
    ancestor loop

I had originally hoped to push the locking of i_rw_sem down into the
filesystems and have the new locking on top of that.  This turned out to
be impractical.  This series leave the i_rw_sem locking where it is,
introduces new locking that happens while the directory is locked, and
gives the filesystem the option of disabling (most of) the i_rw_sem
locking.  Once all filesystems are converted the i_rw_sem locking can be
removed.

Shared lock on i_rw_sem is still used for readdir and simple lookup, to
exclude it while rmdir is happening.

The problem with pushing i_rw_sem down is that I still want to use it to
exclude readdir while rmdir is happening.  Some readdir implementations
use the result to prime the dcache which means creating d_in_lookup()
dentries in the directory.  If we can do this while holding i_rw_sem,
then it is not safe to take i_rw_sem while holding a d_in_lookup()
dentry.  So i_rw_sem CANNOT be taken after a lookup has been performed -
it must be before, or never.

Another issue is that after taking i_rw_sem in rmdir() I need to wait
for any dentries that are still locked.  Waiting for the dentry lock
while holding i_rw_sem means we cannot take i_rw_sem after getting a
dentry lock.

So we take i_rw_sem for filesystems that still require it (initially
all) but still do the other locking which will be uncontended.  This
exercises the code to help ensure it is ready when we remove the
i_rw_sem requirement for any given filesystem.

The central feature is a per-dentry lock implemented with a couple of
d_flags and wait_var_event/wake_up_var.  A single thread can take 1,
sometimes 2, occasionally 3 locks on different dentries.

A second lock is needed for rename - we lock the two dentries in
address-order after confirming there is no hierarchical relationship.
It is also needed for silly-rename as part of unlink.  In this case the
plan is for the second dentry to always be a d_in_lookup dentry so the
lock is guaranteed to be uncontented.  I'm not sure I got that finished
yet.

The three-dentry case is a rename which results in a silly-rename of the
target.

For rmdir we introduce S_DYING so that marking a directory a S_DEAD is
two-stage.  We mark is S_DYING which will prevent more dentry locks
being taken, then we wait for the locks that were already taken, then
set S_DEAD.

For rename ...  maybe just read the patch.  I tried to explain it
thoroughly.

The goal is to perform create/remove/rename without any mutex/semaphore
held by the VFS.  This will allow concurrent operations in a directory
and prepare the way for async operation so that e.g.  io_uring could be
given a list of many names in a directory to unlink and it could unlink
them in parallel.  We probably need to make changes to the locking on
the inode being removed before this can be fully achieved - I haven't
explored that in detail yet.

Thanks,
NeilBrown