On Mon, May 05, 2025 at 08:02:15PM +0100, Al Viro wrote: > Said that, from digging I'd done last night there might be a different > long-term approach; the thing is, there are very few places where we > ever look at MNT_LOCKED for mounts with !mnt_has_parent(): > do_umount() > can_umount() > do_move_mount() > pivot_root() > and two of those four are of dubious use - can_umount() will be followed > by rechecking in do_umount(), so there's not much point in bailing out > early in a very pathological case. And do_move_mount() might check > MNT_LOCKED on such, but only in move-from-anon case, where we do *NOT* > set MNT_LOCKED on namespace root. Even better, we have a check for !mnt_has_parent() downstream of that in pivot_root() and we reject such there. So having that set on absolute root of non-anon namespaces buys us very little - do_umount() is the only place that would need to explicitly check for that. > IOW, looking at the way things are now, I'm no longer sure that the trick > you've done in da362b09e42 "umount: Do not allow unmounting rootfs" > had been a good idea long-term - it certainly made for smaller fix, > but it overloaded semantics of MNT_LOCKED, making it not just about > protecting the mountpoint against exposure. > > What if we add explicit checks for mnt_has_parent() in do_umount() and > pivot_root() next to existing checks for MNT_LOCKED? Then we can > * have clone_mnt() treat that flag as "internal, ignore" (matter > of fact, I would simply add MNT_LOCKED to MNT_INTERNAL_FLAGS and have > clone_mnt() strip that, same as we do in do_add_mount()). > * have copy_tree() set it right next to attach_mnt(), if the source > had it. Yes, leaving it clear for root of copied tree. > * no need to clear it in the end of __do_loopback() (racy at the > moment, BTW - no mount_lock held there, so race with mount -o remount,ro > for the original is possible, so something is needed there) > * have lock_mnt_tree() skip the MNT_LOCKED not just for the expirable > but for the root of subtree as well. > * don't bother stripping MNT_LOCKED for roots of propagation copies > (or anyone, for that matter) in attach_recursive_mnt() > * don't bother setting it on absolute root of initial namespace > > Looks like we end up with overall simpler code that way; it's certainly > conceptually simpler - MNT_LOCKED is set only on the mounts that do > have parent we care to protect, with that being done atomically wrt > mount becoming reachable (lock_mnt_tree() is in the same lock_mount_hash() > scope where we call commit_tree() that makes the entire subtree reachable). > > Your argument in "mnt: Move the clear of MNT_LOCKED from copy_tree to > it's callers" about wanting to keep it in collect_mounts() for audit > purposes is wrong, AFAICS - audit does not see or care about MNT_LOCKED; > the only thing we use the result of collect_mounts() for is passing > it to iterate_mounts(), which literally "apply this callback to each > vfsmount in the set", completely ignoring anything else. What's more, > all callbacks audit is passing to it actually look only at the inode of > that mount's root... > > Anyway, that's longer-term stuff; I'll put together a patch along those > lines on top of this one. Here it is, on top of the previous. Objections, comments? [PATCH] don't set MNT_LOCKED on parentless mounts Originally MNT_LOCKED meant only one thing - "don't let this mount to be peeled off its parent, we don't want to have mountpoint exposed". Accordingly, it had only been set on mounts that *do* have a parent. Later it had been overloaded with another use - setting it on the absolute root had given free protection against umount(2) on absolute root (possible to trigger, oopsed). Not a bad trick, but it ended up costing more than it bought us. Unfortunately, the cost included both hard-to-reason-about logics and a subtle race between mount -o remount,ro and mount --[r]bind - lockless &= ~MNT_LOCKED in the end of __do_loopback() could race with sb_prepare_remount_readonly() setting and clearing MNT_HOLD_WRITE (under mount_lock, as it should be). Turns out that nobody except umount(2) had ever made use of having MNT_LOCKED set on absolute root. So let's give up on that trick, clever as it had been, add an explicit check in do_umount() and return to using MNT_LOCKED only for mounts that have a parent. It means that * clone_mnt() no longer copies MNT_LOCKED * copy_tree() sets it on submounts if their counterparts had been marked such, and do that right next to attach_mnt() in there, in the same mount_lock scope. * __do_loopback() no longer needs to strip MNT_LOCKED off the root of subtree it's about to return; no store, no race. * init_mount_tree() doesn't bother setting MNT_LOCKED on absolute root. * lock_mnt_tree() does not set MNT_LOCKED on the subtree's root; accordingly, its caller (loop in attach_recursive_mnt()) does not need to bother stripping that MNT_LOCKED. Note that lock_mnt_tree() setting MNT_LOCKED on submounts happens in the same mount_lock scope as __attach_mnt() (from commit_tree()) that makes them reachable. Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> --- diff --git a/fs/namespace.c b/fs/namespace.c index 04a9bb9f31fa..165b3bd26857 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1363,7 +1363,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, } mnt->mnt.mnt_flags = old->mnt.mnt_flags; - mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL); + mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL|MNT_LOCKED); atomic_inc(&sb->s_active); mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt)); @@ -2038,6 +2038,9 @@ static int do_umount(struct mount *mnt, int flags) if (mnt->mnt.mnt_flags & MNT_LOCKED) goto out; + if (!mnt_has_parent(mnt)) + goto out; + event++; if (flags & MNT_DETACH) { if (mnt_ns_attached(mnt) || !list_empty(&mnt->mnt_list)) @@ -2308,6 +2311,8 @@ struct mount *copy_tree(struct mount *src_root, struct dentry *dentry, if (IS_ERR(dst_mnt)) goto out; lock_mount_hash(); + if (src_mnt->mnt.mnt_flags & MNT_LOCKED) + dst_mnt->mnt.mnt_flags |= MNT_LOCKED; list_add_tail(&dst_mnt->mnt_list, &res->mnt_list); attach_mnt(dst_mnt, dst_parent, src_parent->mnt_mp, false); unlock_mount_hash(); @@ -2547,7 +2552,7 @@ static void lock_mnt_tree(struct mount *mnt) if (flags & MNT_NOEXEC) flags |= MNT_LOCK_NOEXEC; /* Don't allow unprivileged users to reveal what is under a mount */ - if (list_empty(&p->mnt_expire)) + if (list_empty(&p->mnt_expire) && p != mnt) flags |= MNT_LOCKED; p->mnt.mnt_flags = flags; } @@ -2758,7 +2763,6 @@ static int attach_recursive_mnt(struct mount *source_mnt, /* Notice when we are propagating across user namespaces */ if (child->mnt_parent->mnt_ns->user_ns != user_ns) lock_mnt_tree(child); - child->mnt.mnt_flags &= ~MNT_LOCKED; commit_tree(child); } put_mountpoint(smp); @@ -3027,26 +3031,21 @@ static inline bool may_copy_tree(struct path *path) static struct mount *__do_loopback(struct path *old_path, int recurse) { - struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt); + struct mount *old = real_mount(old_path->mnt); if (IS_MNT_UNBINDABLE(old)) - return mnt; + return ERR_PTR(-EINVAL); if (!may_copy_tree(old_path)) - return mnt; + return ERR_PTR(-EINVAL); if (!recurse && has_locked_children(old, old_path->dentry)) - return mnt; + return ERR_PTR(-EINVAL); if (recurse) - mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE); + return copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE); else - mnt = clone_mnt(old, old_path->dentry, 0); - - if (!IS_ERR(mnt)) - mnt->mnt.mnt_flags &= ~MNT_LOCKED; - - return mnt; + return clone_mnt(old, old_path->dentry, 0); } /* @@ -4789,11 +4788,11 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, if (!path_mounted(&root)) goto out4; /* not a mountpoint */ if (!mnt_has_parent(root_mnt)) - goto out4; /* not attached */ + goto out4; /* absolute root */ if (!path_mounted(&new)) goto out4; /* not a mountpoint */ if (!mnt_has_parent(new_mnt)) - goto out4; /* not attached */ + goto out4; /* absolute root */ /* make sure we can reach put_old from new_root */ if (!is_path_reachable(old_mnt, old.dentry, &new)) goto out4; @@ -6191,7 +6190,6 @@ static void __init init_mount_tree(void) root.mnt = mnt; root.dentry = mnt->mnt_root; - mnt->mnt_flags |= MNT_LOCKED; set_fs_pwd(current->fs, &root); set_fs_root(current->fs, &root);