> Consider the following setup: > > mkdir foo > mkdir bar > mkdir splat > mount -t tmpfs none foo # mount 1 > mount -t tmpfs none bar # mount 2 > mkdir bar/baz > mount -t tmpfs none bar/baz # mount 3 > > then > > A: move_mount(AT_FDCWD, "foo", AT_FDCWD, "bar/baz", MOVE_MOUNT_BENEATH) > gets to do_move_mount() and into do_lock_mount() called by it. > > path->mnt points to mount 3, path->dentry - to its root. Both are pinned. > do_lock_mount() goes into the first iteration of loop. beneath is true, > so it picks dentry - that of #3 mountpoint, i.e. "/baz" on #2 tmpfs instance. > > At that point refcount of that dentry is 3 - one from being a positive on > tmpfs, one from being a mountpoint and one more just grabbed by do_lock_mount(). > > Now we enter inode_lock(dentry->d_inode). Note that at that point A is not > holding any locks. Suppose it gets preempted at this moment for whatever reason. > > B: mount --move bar/baz splat > Proceeds without any problems, mount #3 gets moved to "splat". Now refcount > of mount #2 is not pinned by anything and refcount of "/baz" on it is 2, since > it's no longer a mountpoint. > > B: umount bar > ... and now it hits the fan, since the refcount of mount #2 is not elevated by > anything, so we do not hit -EBUSY and proceed through umount(2) all the way to Ok, what you want to say is that we're not keeping a refcount on path->mnt where the mountpoint is located that we're looking up in the beneath case. First iteration, @path->mnt_3 will hold a reference to mount 3. We now get a reference to @m->mnt_mountpoint that resides on mnt_2 but we're not holding a reference to mnt_2. > As for the second issue... Normal callers of unlock_mount() do have a struct path > somewhere that pins the location we are dealing with. However, 'beneath' case > of do_move_mount() does not - it relies upon the sucker being a mountpoint all > along. Which is fine until you drop namespace_sem. As soon as namespace_unlock() > has been called, there's no warranty that it will _stay_ a mountpoint. Moving > that inode_unlock() before the namespace_unlock() avoids that scenario. Right. Thanks for the details, that helped! Can you resend your changes as a proper commit, please? I have a set of fixes for -rc4 already and ideally I'd like to include and backport this right away (We also should really have at least a test with exactly that layout you're describing.). It'd be nice if we had infra to force context switches so things like that were reliably testable...