Re: [RFC] MNT_LOCKED vs. finish_automount()

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Tue, 6 May 2025 03:25:11 +0100

On Mon, May 05, 2025 at 08:02:15PM +0100, Al Viro wrote:

> Said that, from digging I'd done last night there might be a different
> long-term approach; the thing is, there are very few places where we
> ever look at MNT_LOCKED for mounts with !mnt_has_parent():
> 	do_umount()
> 	can_umount()
> 	do_move_mount()
> 	pivot_root()
> and two of those four are of dubious use - can_umount() will be followed
> by rechecking in do_umount(), so there's not much point in bailing out
> early in a very pathological case.  And do_move_mount() might check
> MNT_LOCKED on such, but only in move-from-anon case, where we do *NOT*
> set MNT_LOCKED on namespace root.

Even better, we have a check for !mnt_has_parent() downstream of that in
pivot_root() and we reject such there.  So having that set on absolute
root of non-anon namespaces buys us very little - do_umount() is the only
place that would need to explicitly check for that.

> IOW, looking at the way things are now, I'm no longer sure that the trick
> you've done in da362b09e42 "umount: Do not allow unmounting rootfs"
> had been a good idea long-term - it certainly made for smaller fix,
> but it overloaded semantics of MNT_LOCKED, making it not just about
> protecting the mountpoint against exposure.
> 
> What if we add explicit checks for mnt_has_parent() in do_umount() and
> pivot_root() next to existing checks for MNT_LOCKED?  Then we can
> 	* have clone_mnt() treat that flag as "internal, ignore" (matter
> of fact, I would simply add MNT_LOCKED to MNT_INTERNAL_FLAGS and have
> clone_mnt() strip that, same as we do in do_add_mount()).
> 	* have copy_tree() set it right next to attach_mnt(), if the source
> had it.  Yes, leaving it clear for root of copied tree.
> 	* no need to clear it in the end of __do_loopback() (racy at the
> moment, BTW - no mount_lock held there, so race with mount -o remount,ro
> for the original is possible, so something is needed there)
> 	* have lock_mnt_tree() skip the MNT_LOCKED not just for the expirable
> but for the root of subtree as well.
> 	* don't bother stripping MNT_LOCKED for roots of propagation copies
> (or anyone, for that matter) in attach_recursive_mnt()
> 	* don't bother setting it on absolute root of initial namespace
> 
> Looks like we end up with overall simpler code that way; it's certainly
> conceptually simpler - MNT_LOCKED is set only on the mounts that do
> have parent we care to protect, with that being done atomically wrt
> mount becoming reachable (lock_mnt_tree() is in the same lock_mount_hash()
> scope where we call commit_tree() that makes the entire subtree reachable).
> 
> Your argument in "mnt: Move the clear of MNT_LOCKED from copy_tree to
> it's callers" about wanting to keep it in collect_mounts() for audit
> purposes is wrong, AFAICS - audit does not see or care about MNT_LOCKED;
> the only thing we use the result of collect_mounts() for is passing
> it to iterate_mounts(), which literally "apply this callback to each
> vfsmount in the set", completely ignoring anything else.  What's more,
> all callbacks audit is passing to it actually look only at the inode of
> that mount's root...
> 
> Anyway, that's longer-term stuff; I'll put together a patch along those
> lines on top of this one.

Here it is, on top of the previous.  Objections, comments?

[PATCH] don't set MNT_LOCKED on parentless mounts

Originally MNT_LOCKED meant only one thing - "don't let this mount to
be peeled off its parent, we don't want to have mountpoint exposed".
Accordingly, it had only been set on mounts that *do* have a parent.
Later it had been overloaded with another use - setting it on the
absolute root had given free protection against umount(2) on absolute
root (possible to trigger, oopsed).  Not a bad trick, but it ended
up costing more than it bought us.  Unfortunately, the cost included
both hard-to-reason-about logics and a subtle race between
mount -o remount,ro and mount --[r]bind - lockless &= ~MNT_LOCKED in
the end of __do_loopback() could race with sb_prepare_remount_readonly()
setting and clearing MNT_HOLD_WRITE (under mount_lock, as it should
be).

Turns out that nobody except umount(2) had ever made use of having
MNT_LOCKED set on absolute root.  So let's give up on that trick,
clever as it had been, add an explicit check in do_umount() and
return to using MNT_LOCKED only for mounts that have a parent.

It means that
	* clone_mnt() no longer copies MNT_LOCKED
	* copy_tree() sets it on submounts if their counterparts had
been marked such, and do that right next to attach_mnt() in there,
in the same mount_lock scope.
	* __do_loopback() no longer needs to strip MNT_LOCKED off the
root of subtree it's about to return; no store, no race.
	* init_mount_tree() doesn't bother setting MNT_LOCKED on absolute
root.
	* lock_mnt_tree() does not set MNT_LOCKED on the subtree's root;
accordingly, its caller (loop in attach_recursive_mnt()) does not need to
bother stripping that MNT_LOCKED.  Note that lock_mnt_tree() setting
MNT_LOCKED on submounts happens in the same mount_lock scope as __attach_mnt()
(from commit_tree()) that makes them reachable.

Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
---

diff --git a/fs/namespace.c b/fs/namespace.c
index 04a9bb9f31fa..165b3bd26857 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1363,7 +1363,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	}
 
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
-	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
+	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL|MNT_LOCKED);
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt));
@@ -2038,6 +2038,9 @@ static int do_umount(struct mount *mnt, int flags)
 	if (mnt->mnt.mnt_flags & MNT_LOCKED)
 		goto out;
 
+	if (!mnt_has_parent(mnt))
+		goto out;
+
 	event++;
 	if (flags & MNT_DETACH) {
 		if (mnt_ns_attached(mnt) || !list_empty(&mnt->mnt_list))
@@ -2308,6 +2311,8 @@ struct mount *copy_tree(struct mount *src_root, struct dentry *dentry,
 			if (IS_ERR(dst_mnt))
 				goto out;
 			lock_mount_hash();
+			if (src_mnt->mnt.mnt_flags & MNT_LOCKED)
+				dst_mnt->mnt.mnt_flags |= MNT_LOCKED;
 			list_add_tail(&dst_mnt->mnt_list, &res->mnt_list);
 			attach_mnt(dst_mnt, dst_parent, src_parent->mnt_mp, false);
 			unlock_mount_hash();
@@ -2547,7 +2552,7 @@ static void lock_mnt_tree(struct mount *mnt)
 		if (flags & MNT_NOEXEC)
 			flags |= MNT_LOCK_NOEXEC;
 		/* Don't allow unprivileged users to reveal what is under a mount */
-		if (list_empty(&p->mnt_expire))
+		if (list_empty(&p->mnt_expire) && p != mnt)
 			flags |= MNT_LOCKED;
 		p->mnt.mnt_flags = flags;
 	}
@@ -2758,7 +2763,6 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 		/* Notice when we are propagating across user namespaces */
 		if (child->mnt_parent->mnt_ns->user_ns != user_ns)
 			lock_mnt_tree(child);
-		child->mnt.mnt_flags &= ~MNT_LOCKED;
 		commit_tree(child);
 	}
 	put_mountpoint(smp);
@@ -3027,26 +3031,21 @@ static inline bool may_copy_tree(struct path *path)
 
 static struct mount *__do_loopback(struct path *old_path, int recurse)
 {
-	struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
+	struct mount *old = real_mount(old_path->mnt);
 
 	if (IS_MNT_UNBINDABLE(old))
-		return mnt;
+		return ERR_PTR(-EINVAL);
 
 	if (!may_copy_tree(old_path))
-		return mnt;
+		return ERR_PTR(-EINVAL);
 
 	if (!recurse && has_locked_children(old, old_path->dentry))
-		return mnt;
+		return ERR_PTR(-EINVAL);
 
 	if (recurse)
-		mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
+		return copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
 	else
-		mnt = clone_mnt(old, old_path->dentry, 0);
-
-	if (!IS_ERR(mnt))
-		mnt->mnt.mnt_flags &= ~MNT_LOCKED;
-
-	return mnt;
+		return clone_mnt(old, old_path->dentry, 0);
 }
 
 /*
@@ -4789,11 +4788,11 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
 	if (!path_mounted(&root))
 		goto out4; /* not a mountpoint */
 	if (!mnt_has_parent(root_mnt))
-		goto out4; /* not attached */
+		goto out4; /* absolute root */
 	if (!path_mounted(&new))
 		goto out4; /* not a mountpoint */
 	if (!mnt_has_parent(new_mnt))
-		goto out4; /* not attached */
+		goto out4; /* absolute root */
 	/* make sure we can reach put_old from new_root */
 	if (!is_path_reachable(old_mnt, old.dentry, &new))
 		goto out4;
@@ -6191,7 +6190,6 @@ static void __init init_mount_tree(void)
 
 	root.mnt = mnt;
 	root.dentry = mnt->mnt_root;
-	mnt->mnt_flags |= MNT_LOCKED;
 
 	set_fs_pwd(current->fs, &root);
 	set_fs_root(current->fs, &root);