Re: [PATCH][RFC] ->mnt_devname is never NULL

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 22 Apr 2025 15:40:54 +0200

> Consider the following setup:
> 
> mkdir foo
> mkdir bar
> mkdir splat
> mount -t tmpfs none foo		# mount 1
> mount -t tmpfs none bar		# mount 2
> mkdir bar/baz
> mount -t tmpfs none bar/baz	# mount 3
> 
> then
> 
> A: move_mount(AT_FDCWD, "foo", AT_FDCWD, "bar/baz", MOVE_MOUNT_BENEATH)
> gets to do_move_mount() and into do_lock_mount() called by it.
> 
> path->mnt points to mount 3, path->dentry - to its root.  Both are pinned.
> do_lock_mount() goes into the first iteration of loop.  beneath is true,
> so it picks dentry - that of #3 mountpoint, i.e. "/baz" on #2 tmpfs instance.
> 
> At that point refcount of that dentry is 3 - one from being a positive on
> tmpfs, one from being a mountpoint and one more just grabbed by do_lock_mount().
> 
> Now we enter inode_lock(dentry->d_inode).  Note that at that point A is not
> holding any locks.  Suppose it gets preempted at this moment for whatever reason.
> 
> B: mount --move bar/baz splat
> Proceeds without any problems, mount #3 gets moved to "splat".  Now refcount
> of mount #2 is not pinned by anything and refcount of "/baz" on it is 2, since
> it's no longer a mountpoint.
> 
> B: umount bar
> ... and now it hits the fan, since the refcount of mount #2 is not elevated by
> anything, so we do not hit -EBUSY and proceed through umount(2) all the way to

Ok, what you want to say is that we're not keeping a refcount on
path->mnt where the mountpoint is located that we're looking up in the
beneath case.

First iteration, @path->mnt_3 will hold a reference to mount 3. We now
get a reference to @m->mnt_mountpoint that resides on mnt_2 but
we're not holding a reference to mnt_2.

> As for the second issue...  Normal callers of unlock_mount() do have a struct path
> somewhere that pins the location we are dealing with.  However, 'beneath' case
> of do_move_mount() does not - it relies upon the sucker being a mountpoint all
> along.  Which is fine until you drop namespace_sem.  As soon as namespace_unlock()
> has been called, there's no warranty that it will _stay_ a mountpoint.  Moving
> that inode_unlock() before the namespace_unlock() avoids that scenario.

Right.

Thanks for the details, that helped!

Can you resend your changes as a proper commit, please? I have a set of
fixes for -rc4 already and ideally I'd like to include and backport this
right away (We also should really have at least a test with exactly that
layout you're describing.).

It'd be nice if we had infra to force context switches so things like
that were reliably testable...