[PATCHES v2][RFC][CFT] mount-related stuff

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 29 Aug 2025 00:07:06 +0100

Branch force-pushed into
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount
(also visible as #v2.mount, #v1.mount being the previous version)
Individual patches in followups.

Still -rc3-based, seems to survive local beating.  Please, help with
review and testing.

Note: no links in commits, I still don't understand what kind of use is
expected in this situation.

Changes since v1 (aside of reviewed-by applied):

	In #13, #14 and #15 scoped_guard replaced with guard.  I don't like
it, but I can live with it.

	Between old #18 and #19: do_new_mount_fc() switched to use of fc_mount().
vfs_get_tree() call moved from the caller into the function itself, unlock +
vfs_create_mount() reordered to before the checks in there and collapsed with
vfs_get_tree() into a call of fc_mount().  Cleanup aside, that avoids the
difference between the lexical scope of mnt and the actual lifetime of that
reference.
	Differs from the variant posted in https://lore.kernel.org/all/20250826182124.GV39973@ZenIV/
only by fixing an obvious braino - fetching fc->root->d_sb should be done after
successful fc_mount(), not before it.
	That change modifies old #25 (now #26) "do_new_mount_rc(): use __free()
to deal with dropping mnt on failure".

	Added to the end of queue: cleanup of populating a new namespace with
a tree (open_detached_copy() and copy_mnt_ns()); both end up using guards, BTW. 
	5 commits, #54..#58
	* open_detached_copy(): don't bother with mount_lock_hash()
It's useless there right now - namespace_excl is quite enough.
	* open_detached_copy(): separate creation of namespace into helper
Creation of namespace and opening that FMODE_NEED_UNMOUNT file are better
off separated - cleaner that way.
	* mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list
Currently it (and free_mnt_ns()) can't be used with non-anon namespace before
the insertion into mnt_ns_tree; very easy to make it work in such situation as
well - in fact, the old "is it non-anonymous" check is not needed anymore.
	* copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure
Use the previous patch to avoid weird open-coding of free_mnt_ns().
	* copy_mnt_ns(): use guards
... and __free(mntput) for rootmnt/pwdmnt.

	Added to the end of queue: handling of ->s_mounts/->mnt_instance and
mnt_hold_writers().
	Each mount is associated with the same dentry (sub)tree of the same
filesystem through its entire lifetime.  They are allocated empty, then (in the
same function that had called allocator) attached to dentry tree and stay like
that all the way to destructor (cleanup_mnt()).
	Unfortunately, as soon as they are attached to a tree, they become
reachable from shared data structures - we maintain the set of all mounts
associated with given superblock.  Having to worry about that while we are
still setting them up is inconvenient.  Thankfully, the accesses via that set
are *very* limited - only sb_prepare_remount_readonly() goes there and the
only thing it does to a mount is setting/clearing MNT_WRITE_HOLD and checking
the write count (guaranteed to be zero during setup, since there's nobody
who could've asked for write access by that point).
	Turns out it's easy to take MNT_WRITE_HOLD out of ->mnt_flags and
basically move it into the same thing that establishes linkage in per-superblock
set of mounts.  That makes accesses via that set isolated from the rest of
struct mount; as far as we are concerned, this set is no longer a way to reach
the mount from shared data structures and mount remains private to caller
until it is explicitly made reachable (by mounting, attaching to overlayfs as
a layer, etc.).
	FWIW, I think we should get rid of the "empty" state of struct mount
and have allocator take the root dentry as additional argument.  Hadn't done
that yet; this series removes the need to delay attaching a partially set up
mount to filesystem - we can do that from the very beginning now.
	5 commits, #59..#63
	* setup_mnt(): primitive for connecting a mount to filesystem
Identical logics in clone_mnt() and vfs_create_mount() => common helper
	* preparations to taking MNT_WRITE_HOLD out of ->mnt_flags
Change the representation of set from list_head list to something equivalent
to hlist one, with forward linkage going to the entire struct mount rather
than embedded hlist_node.
	* struct mount: relocate MNT_WRITE_HOLD bit
Steal the LSB of back links in the set representation to store it.  We only
traverse the list forwards and all changes are under mount_lock, same as
for all mnt_hold_writers()/mnt_unhold_writers() pairs, so it's pretty
uncomplicated.
	* simplify the callers of mnt_unhold_writers()
	* WRITE_HOLD machinery: no need for to bump mount_lock seqcount
The last part is another group of "we only need mount_locked_reader" cases

Diffstat:
 fs/ecryptfs/dentry.c          |  14 +-
 fs/ecryptfs/ecryptfs_kernel.h |  27 +-
 fs/ecryptfs/file.c            |  15 +-
 fs/ecryptfs/inode.c           |  19 +-
 fs/ecryptfs/main.c            |  24 +-
 fs/internal.h                 |   4 +-
 fs/mount.h                    |  16 +-
 fs/namespace.c                | 989 +++++++++++++++++++-----------------------
 fs/pnode.c                    |  75 +++-
 fs/pnode.h                    |   1 +
 fs/super.c                    |   3 +-
 include/linux/fs.h            |   2 +-
 include/linux/mount.h         |   7 +-
 kernel/audit_tree.c           |  12 +-
 14 files changed, 573 insertions(+), 635 deletions(-)