Re: [PATCH][RFC] ->mnt_devname is never NULL

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Tue, 22 Apr 2025 13:25:14 +0100

On Tue, Apr 22, 2025 at 09:31:14AM +0200, Christian Brauner wrote:
> On Mon, Apr 21, 2025 at 05:29:47PM +0100, Al Viro wrote:
> > On Mon, Apr 21, 2025 at 09:56:20AM +0200, Christian Brauner wrote:
> > > On Mon, Apr 21, 2025 at 04:35:09AM +0100, Al Viro wrote:
> > > > Not since 8f2918898eb5 "new helpers: vfs_create_mount(), fc_mount()"
> > > > back in 2018.  Get rid of the dead checks...
> > > >     
> > > > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
> > > > ---
> > > 
> > > Good idea. Fwiw, I've put this into vfs-6.16.mount with some other minor
> > > stuff. If you're keeping it yourself let me know.
> > 
> > Not sure...  I'm going through documenting the struct mount lifecycle/locking/etc.
> > and it already looks like there will be more patches, but then some are going
> > to be #fixes fodder.
> > 
> > Example caught just a couple of minutes ago: do_lock_mount()
> >                 if (beneath) {
> >                         m = real_mount(mnt);
> >                         read_seqlock_excl(&mount_lock);
> >                         dentry = dget(m->mnt_mountpoint);
> >                         read_sequnlock_excl(&mount_lock);
> >                 } else {
> >                         dentry = path->dentry;
> >                 }
> > 
> >                 inode_lock(dentry->d_inode);
> > What's to prevent the 'beneath' case from getting mnt mount --move'd
> > away *AND* the ex-parent from getting unmounted while we are blocked
> > in inode_lock?  At this point we are not holding any locks whatsoever
> > (and all mount-related locks nest inside inode_lock(), so we couldn't
> > hold them there anyway).
> > 
> > Hit that race and watch a very unhappy umount...
> 
> If it gets unmounted or moved we immediately detect this in the next line:
> 
> if (beneath && (!is_mounted(mnt) || m->mnt_mountpoint != dentry)) {

Sure, we would - *AFTER* we get through that inode_lock().

Consider the following setup:

mkdir foo
mkdir bar
mkdir splat
mount -t tmpfs none foo		# mount 1
mount -t tmpfs none bar		# mount 2
mkdir bar/baz
mount -t tmpfs none bar/baz	# mount 3

then

A: move_mount(AT_FDCWD, "foo", AT_FDCWD, "bar/baz", MOVE_MOUNT_BENEATH)
gets to do_move_mount() and into do_lock_mount() called by it.

path->mnt points to mount 3, path->dentry - to its root.  Both are pinned.
do_lock_mount() goes into the first iteration of loop.  beneath is true,
so it picks dentry - that of #3 mountpoint, i.e. "/baz" on #2 tmpfs instance.

At that point refcount of that dentry is 3 - one from being a positive on
tmpfs, one from being a mountpoint and one more just grabbed by do_lock_mount().

Now we enter inode_lock(dentry->d_inode).  Note that at that point A is not
holding any locks.  Suppose it gets preempted at this moment for whatever reason.

B: mount --move bar/baz splat
Proceeds without any problems, mount #3 gets moved to "splat".  Now refcount
of mount #2 is not pinned by anything and refcount of "/baz" on it is 2, since
it's no longer a mountpoint.

B: umount bar
... and now it hits the fan, since the refcount of mount #2 is not elevated by
anything, so we do not hit -EBUSY and proceed through umount(2) all the way to
kill_litter_super(), which drops the refcount of "/baz" to 1 and calls kill_anon_super().
Which gets to shrink_dcache_for_umount() and from there - to umount_check() on
that dentry.  You get yelled at, then you get yelled at again for busy inodes
after umount (that dentry is pinning the inode down), etc.  Superblock of #2
is freed.

A: regains CPU.  is_mounted() is true (now at splat instead of bar/baz, but still
mounted), ->mnt_mountpoint does not match.
All right, inode_unlock(dentry->d_inode), then dput(dentry) and now the refcount
of that dentry finally hits zero.  We get iput() on its inode, followed by
shmem_evict_inode() which is where we finally oops.

As for the second issue...  Normal callers of unlock_mount() do have a struct path
somewhere that pins the location we are dealing with.  However, 'beneath' case
of do_move_mount() does not - it relies upon the sucker being a mountpoint all
along.  Which is fine until you drop namespace_sem.  As soon as namespace_unlock()
has been called, there's no warranty that it will _stay_ a mountpoint.  Moving
that inode_unlock() before the namespace_unlock() avoids that scenario.