On Tue, Apr 22, 2025 at 09:31:14AM +0200, Christian Brauner wrote: > On Mon, Apr 21, 2025 at 05:29:47PM +0100, Al Viro wrote: > > On Mon, Apr 21, 2025 at 09:56:20AM +0200, Christian Brauner wrote: > > > On Mon, Apr 21, 2025 at 04:35:09AM +0100, Al Viro wrote: > > > > Not since 8f2918898eb5 "new helpers: vfs_create_mount(), fc_mount()" > > > > back in 2018. Get rid of the dead checks... > > > > > > > > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > > > --- > > > > > > Good idea. Fwiw, I've put this into vfs-6.16.mount with some other minor > > > stuff. If you're keeping it yourself let me know. > > > > Not sure... I'm going through documenting the struct mount lifecycle/locking/etc. > > and it already looks like there will be more patches, but then some are going > > to be #fixes fodder. > > > > Example caught just a couple of minutes ago: do_lock_mount() > > if (beneath) { > > m = real_mount(mnt); > > read_seqlock_excl(&mount_lock); > > dentry = dget(m->mnt_mountpoint); > > read_sequnlock_excl(&mount_lock); > > } else { > > dentry = path->dentry; > > } > > > > inode_lock(dentry->d_inode); > > What's to prevent the 'beneath' case from getting mnt mount --move'd > > away *AND* the ex-parent from getting unmounted while we are blocked > > in inode_lock? At this point we are not holding any locks whatsoever > > (and all mount-related locks nest inside inode_lock(), so we couldn't > > hold them there anyway). > > > > Hit that race and watch a very unhappy umount... > > If it gets unmounted or moved we immediately detect this in the next line: > > if (beneath && (!is_mounted(mnt) || m->mnt_mountpoint != dentry)) { Sure, we would - *AFTER* we get through that inode_lock(). Consider the following setup: mkdir foo mkdir bar mkdir splat mount -t tmpfs none foo # mount 1 mount -t tmpfs none bar # mount 2 mkdir bar/baz mount -t tmpfs none bar/baz # mount 3 then A: move_mount(AT_FDCWD, "foo", AT_FDCWD, "bar/baz", MOVE_MOUNT_BENEATH) gets to do_move_mount() and into do_lock_mount() called by it. path->mnt points to mount 3, path->dentry - to its root. Both are pinned. do_lock_mount() goes into the first iteration of loop. beneath is true, so it picks dentry - that of #3 mountpoint, i.e. "/baz" on #2 tmpfs instance. At that point refcount of that dentry is 3 - one from being a positive on tmpfs, one from being a mountpoint and one more just grabbed by do_lock_mount(). Now we enter inode_lock(dentry->d_inode). Note that at that point A is not holding any locks. Suppose it gets preempted at this moment for whatever reason. B: mount --move bar/baz splat Proceeds without any problems, mount #3 gets moved to "splat". Now refcount of mount #2 is not pinned by anything and refcount of "/baz" on it is 2, since it's no longer a mountpoint. B: umount bar ... and now it hits the fan, since the refcount of mount #2 is not elevated by anything, so we do not hit -EBUSY and proceed through umount(2) all the way to kill_litter_super(), which drops the refcount of "/baz" to 1 and calls kill_anon_super(). Which gets to shrink_dcache_for_umount() and from there - to umount_check() on that dentry. You get yelled at, then you get yelled at again for busy inodes after umount (that dentry is pinning the inode down), etc. Superblock of #2 is freed. A: regains CPU. is_mounted() is true (now at splat instead of bar/baz, but still mounted), ->mnt_mountpoint does not match. All right, inode_unlock(dentry->d_inode), then dput(dentry) and now the refcount of that dentry finally hits zero. We get iput() on its inode, followed by shmem_evict_inode() which is where we finally oops. As for the second issue... Normal callers of unlock_mount() do have a struct path somewhere that pins the location we are dealing with. However, 'beneath' case of do_move_mount() does not - it relies upon the sucker being a mountpoint all along. Which is fine until you drop namespace_sem. As soon as namespace_unlock() has been called, there's no warranty that it will _stay_ a mountpoint. Moving that inode_unlock() before the namespace_unlock() avoids that scenario.