[RFC] separate the internal mount flags from the rest

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Tue, 3 Jun 2025 21:47:04 +0100

	Currently we use ->mnt_flags for all kinds of things, including
the ones only fs/{namespace,pnode}.c care about.

	That wouldn't be a problem if not for the locking.  ->mnt_flags is
protected by mount_lock.  All writers MUST grab that.  Having lockless
readers is unsurprising - after all, for something like noexec we want
the current state of mount, whatever it happens to be.  If userland
remounts something noexec and that races with execve(2), there's nothing
to be done - it is a race, but not the kernel one.

	However, for a bunch of flags we rely upon the fact that all
changes in one of those are always done under namespace_sem (exclusive).
It's not a lockless read - we do depend upon namespace_sem (shared) being
sufficient to stabilize those particular bits.	MNT_SHARED is one such.
Note that this flag is a part of propagation graph representation and
namespace_sem is what protects the entire thing, so setting or clearing
it under mount_lock alone would be a very bad idea.

Since MNT_SHARED sits in ->mnt_flags, we have to take mount_lock to set or
clear it, leading to places like this:
        if (IS_MNT_SHARED(from)) {
		to->mnt_group_id = from->mnt_group_id;
		list_add(&to->mnt_share, &from->mnt_share);
		lock_mount_hash();
		set_mnt_shared(to);
		unlock_mount_hash();
	}
Non-empty ->mnt_share on a mount without MNT_SHARED would be a hard bug;
these changes belong together.  Incidentally, lock_mount_hash() is an
overkill here - read_seqlock_excl(&mount_lock) would be better...

The rules would be easier to follow if we took MNT_SHARED, MNT_UNBINDABLE,
MNT_MARKED and possibly MNT_LOCKED and MNT_LOCK_* to separate field, protected
by namespace_sem alone (MNT_LOCKED - depending upon how the path_parent() mess
settles).

Yes, it's 4 bytes added into struct mount.  However, this
        int mnt_id;                     /* mount identifier, reused */
	u64 mnt_id_unique;              /* mount ID unique until reboot */
	int mnt_group_id;               /* peer group identifier */
	int mnt_expiry_mark;            /* true if marked for expiry */
is preceded and followed by a pointer, so we already have a gap there,
_and_ there are other pending changes that kill ->mnt_umounting.  So the
size of struct mount won't grow even on 32bit and would actually go down
on 64bit.

Objections?  The thing I really want is clear locking rules for that
stuff.  Reduced contention on mount_lock and fewer increments of its
seqcount component wouldn't hurt either, but that's secondary...

PS: despite being strictly namespace.c-internal, two flags must stay in
->mnt_flags - MNT_DOOMED and MNT_SYNC_UMOUNT - __legitimize_mnt() needs
them there.  Could be folded together with a bit of massage, though, but
that's a separate story...