Re: [RFC] separate the internal mount flags from the rest

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 4 Jun 2025 09:36:31 +0200

On Tue, Jun 03, 2025 at 09:47:04PM +0100, Al Viro wrote:
> 	Currently we use ->mnt_flags for all kinds of things, including
> the ones only fs/{namespace,pnode}.c care about.
> 
> 	That wouldn't be a problem if not for the locking.  ->mnt_flags is
> protected by mount_lock.  All writers MUST grab that.  Having lockless
> readers is unsurprising - after all, for something like noexec we want
> the current state of mount, whatever it happens to be.  If userland
> remounts something noexec and that races with execve(2), there's nothing
> to be done - it is a race, but not the kernel one.
> 
> 	However, for a bunch of flags we rely upon the fact that all
> changes in one of those are always done under namespace_sem (exclusive).
> It's not a lockless read - we do depend upon namespace_sem (shared) being
> sufficient to stabilize those particular bits.	MNT_SHARED is one such.
> Note that this flag is a part of propagation graph representation and
> namespace_sem is what protects the entire thing, so setting or clearing
> it under mount_lock alone would be a very bad idea.
> 
> Since MNT_SHARED sits in ->mnt_flags, we have to take mount_lock to set or
> clear it, leading to places like this:
>         if (IS_MNT_SHARED(from)) {
> 		to->mnt_group_id = from->mnt_group_id;
> 		list_add(&to->mnt_share, &from->mnt_share);
> 		lock_mount_hash();
> 		set_mnt_shared(to);
> 		unlock_mount_hash();
> 	}
> Non-empty ->mnt_share on a mount without MNT_SHARED would be a hard bug;
> these changes belong together.  Incidentally, lock_mount_hash() is an
> overkill here - read_seqlock_excl(&mount_lock) would be better...
> 
> The rules would be easier to follow if we took MNT_SHARED, MNT_UNBINDABLE,
> MNT_MARKED and possibly MNT_LOCKED and MNT_LOCK_* to separate field, protected
> by namespace_sem alone (MNT_LOCKED - depending upon how the path_parent() mess
> settles).

Agreed.

> 
> Yes, it's 4 bytes added into struct mount.  However, this
>         int mnt_id;                     /* mount identifier, reused */
> 	u64 mnt_id_unique;              /* mount ID unique until reboot */
> 	int mnt_group_id;               /* peer group identifier */
> 	int mnt_expiry_mark;            /* true if marked for expiry */
> is preceded and followed by a pointer, so we already have a gap there,
> _and_ there are other pending changes that kill ->mnt_umounting.  So the
> size of struct mount won't grow even on 32bit and would actually go down
> on 64bit.

I'm not at all worried about the size in this case.

> 
> Objections?  The thing I really want is clear locking rules for that
> stuff.  Reduced contention on mount_lock and fewer increments of its
> seqcount component wouldn't hurt either, but that's secondary...

Seems good to me!

> PS: despite being strictly namespace.c-internal, two flags must stay in
> ->mnt_flags - MNT_DOOMED and MNT_SYNC_UMOUNT - __legitimize_mnt() needs

Sure.