Re: [RFC] MNT_LOCKED vs. finish_automount()

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 5 May 2025 20:02:15 +0100

On Mon, May 05, 2025 at 08:52:16AM -0500, Eric W. Biederman wrote:

> How would you deal with this in attach_recursive_mnt?  The problem case
> appears to be an automount on top of a recursive mount.  So I don't
> see the recursive mount code path being involved at all.
> 
> 
> 
> Given that only new mounts explicitly specified by a user and
> finish_automount call do_add_mount this looks safe.  Having
> MNT_LOCKED set will always be inappropriate in these two
> cases.

do_add_mount() -> graft_tree() -> attach_recursive_mnt()

and the last one is where we actually attach them to mount tree.
And that's really where all subtrees are attached, except for the very
local surgery in pivot_root().

Said that, from digging I'd done last night there might be a different
long-term approach; the thing is, there are very few places where we
ever look at MNT_LOCKED for mounts with !mnt_has_parent():
	do_umount()
	can_umount()
	do_move_mount()
	pivot_root()
and two of those four are of dubious use - can_umount() will be followed
by rechecking in do_umount(), so there's not much point in bailing out
early in a very pathological case.  And do_move_mount() might check
MNT_LOCKED on such, but only in move-from-anon case, where we do *NOT*
set MNT_LOCKED on namespace root.

IOW, looking at the way things are now, I'm no longer sure that the trick
you've done in da362b09e42 "umount: Do not allow unmounting rootfs"
had been a good idea long-term - it certainly made for smaller fix,
but it overloaded semantics of MNT_LOCKED, making it not just about
protecting the mountpoint against exposure.

What if we add explicit checks for mnt_has_parent() in do_umount() and
pivot_root() next to existing checks for MNT_LOCKED?  Then we can
	* have clone_mnt() treat that flag as "internal, ignore" (matter
of fact, I would simply add MNT_LOCKED to MNT_INTERNAL_FLAGS and have
clone_mnt() strip that, same as we do in do_add_mount()).
	* have copy_tree() set it right next to attach_mnt(), if the source
had it.  Yes, leaving it clear for root of copied tree.
	* no need to clear it in the end of __do_loopback() (racy at the
moment, BTW - no mount_lock held there, so race with mount -o remount,ro
for the original is possible, so something is needed there)
	* have lock_mnt_tree() skip the MNT_LOCKED not just for the expirable
but for the root of subtree as well.
	* don't bother stripping MNT_LOCKED for roots of propagation copies
(or anyone, for that matter) in attach_recursive_mnt()
	* don't bother setting it on absolute root of initial namespace

Looks like we end up with overall simpler code that way; it's certainly
conceptually simpler - MNT_LOCKED is set only on the mounts that do
have parent we care to protect, with that being done atomically wrt
mount becoming reachable (lock_mnt_tree() is in the same lock_mount_hash()
scope where we call commit_tree() that makes the entire subtree reachable).

Your argument in "mnt: Move the clear of MNT_LOCKED from copy_tree to
it's callers" about wanting to keep it in collect_mounts() for audit
purposes is wrong, AFAICS - audit does not see or care about MNT_LOCKED;
the only thing we use the result of collect_mounts() for is passing
it to iterate_mounts(), which literally "apply this callback to each
vfsmount in the set", completely ignoring anything else.  What's more,
all callbacks audit is passing to it actually look only at the inode of
that mount's root...

Anyway, that's longer-term stuff; I'll put together a patch along those
lines on top of this one.  Do you have any objections to the minimal
fix as posted?  Or, for that matter, to a variant that simply adds
MNT_LOCKED to MNT_INTERNAL_FLAGS (not used anywhere outside of
do_add_mount()):

[PATCH] finish_automount(): don't leak MNT_LOCKED from parent to child

Intention for MNT_LOCKED had always been to protect the internal
mountpoints within a subtree that got copied across the userns boundary,
not the mountpoint that tree got attached to - after all, it _was_
exposed before the copying.

For roots of secondary copies that is enforced in attach_recursive_mnt() -
MNT_LOCKED is explicitly stripped for those.  For the root of primary
copy we are almost always guaranteed that MNT_LOCKED won't be there,
so attach_recursive_mnt() doesn't bother.  Unfortunately, one call
chain got overlooked - triggering e.g. NFS referral will have the
submount inherit the public flags from parent; that's fine for such
things as read-only, nosuid, etc., but not for MNT_LOCKED.

This is particularly pointless since the mount attached by finish_automount()
is usually expirable, which makes any protection granted by MNT_LOCKED
null and void; just wait for a while and that mount will go away on its own.

Include MNT_LOCKED into the set of flags to be ignored by do_add_mount() - it
really is an internal flag.

Fixes: 5ff9d8a65ce8 ("vfs: Lock in place mounts from more privileged users")
---

diff --git a/include/linux/mount.h b/include/linux/mount.h
index 5b69f8ede215..c1bb278b475c 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -49,8 +49,8 @@ struct path;
 				 | MNT_READONLY | MNT_NOSYMFOLLOW)
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
-#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
-			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
+#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_DOOMED | MNT_MARKED | MNT_LOCKED \
+			    | MNT_WRITE_HOLD | MNT_SYNC_UMOUNT | MNT_INTERNAL)
 
 #define MNT_INTERNAL	0x4000