Re: [PATCHED][RFC][CFT] mount-related stuff

Christian Brauner <brauner@xxxxxxxxxx> · Mon, 25 Aug 2025 14:43:43 +0200

On Mon, Aug 25, 2025 at 05:40:46AM +0100, Al Viro wrote:
> 	Most of this pile is basically an attempt to see how well do
> cleanup.h-style mechanisms apply in mount handling.  That stuff lives in
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount
> Rebased to -rc3 (used to be a bit past -rc2, branched at mount fixes merge)
> Individual patches in followups.
> 
> 	Please, help with review and testing.  It seems to survive the
> local beating and code generation seems to be OK, but more testing
> would be a good thing and I would really like to see comments on that
> stuff.

Btw, I just realized that basically none of your commits have any lore
links in them. That kinda sucks because I very very often just look at a
commit and then use the link to jump to the mailing list discussion for
more context about a change and how it came about.

So pretty please can you start adding lore links to your commits when
applying if it's not fucking up your workflow too much?

> 
> 	This is not all I've got around mount handling, but I'd rather
> get that thing out for review before starting to sort out other local
> mount-related branches.
> 
> 	Series overview:
> 
> 	Part 1: guards.
> 
> 	This part starts with infrastructure, followed by one-by-one
> conversions to the guard/scoped_guard in some of the places that fit
> that well enough.  Note that one of those places turned out to be taking
> mount_lock for no reason whatsoever; I already see places where we do
> write_seqlock when read_seqlock_excl would suffice, etc.
> 
> 	Folks, _please_ don't do any bulk conversions in that area.
> IMO one area where RAII becomes dangerous is locking; usually it's not
> a big deal to delay freeing some object a bit, but delay dropping a
> lock and you risk introducing deadlocks that will be bloody hard to spot.
> It _has_ to be done carefully; we had trouble in that area several times
> over the last year or so in fs/namespace.c alone.  Another fun problem
> is that quite a few comments regarding the locking in there are stale.
> We still have the comments that talk about mount lock as if it had been
> an rwlock-like thing.  It hadn't been that for more than a decade now.
> It needs to be documented sanely; so do the access rules to the data
> structures involved.  I hope to get some of that into the tree this cycle,
> but it's still in progress.
> 
> 1/52)  fs/namespace.c: fix the namespace_sem guard mess
> 	New guards: namespace_excl and namespace_shared.  The former implies
> the latter, as for anything rwsem-like.  No inode locks, no dropping the final
> references, no opening files, etc. in scope of those.
> 2/52)  introduced guards for mount_lock
> 	New guards: mount_writer, mount_locked_reader.  That's write_seqlock
> and read_seqlock_excl on mount_lock; obviously, nothing blocking should be
> done in scope of those.
> 3/52)  fs/namespace.c: allow to drop vfsmount references via __free(mntput)
> 	Missing DEFINE_FREE (for mntput()); local in fs/namespace.c, to be
> used only for keeping shit out of namespace_... and mount_... scopes.
> 4/52)  __detach_mounts(): use guards
> 5/52)  __is_local_mountpoint(): use guards
> 6/52)  do_change_type(): use guards
> 7/52)  do_set_group(): use guards
> 8/52)  mark_mounts_for_expiry(): use guards
> 9/52)  put_mnt_ns(): use guards
> 10/52)  mnt_already_visible(): use guards
> 	a bunch of clear-cut conversions, with explanations of the reasons
> why this or that guard is needed.
> 11/52)  check_for_nsfs_mounts(): no need to take locks
> 	... and here we have one where it turns out that locking had been
> excessive.  Iterating through a subtree in mount_locked_reader scope is
> safe, all right, but (1) mount_writer is not needed here at all and (2)
> namespace_shared + a reference held to the root of subtree is also enough.
> All callers had (2) already.  Documented the locking requirements for
> function, removed {,un}lock_mount_hash() in it...
> 12/52)  propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint()
> 	This one is interesting - existing code had been equivalent to
> scoped_guard(mount_locked_reader), and it's right for that call.  However,
> mnt_set_mountpoint() generally requires mount_writer - the only reason we
> get away with that here is that the mount in question never had been
> reachable from the mounts visible to other threads.
> 13/52)  has_locked_children(): use guards
> 14/52)  mnt_set_expiry(): use guards
> 15/52)  path_is_under(): use guards
> 	more clear-cut conversions with explanations.
> 16/52)  current_chrooted(): don't bother with follow_down_one()
> 17/52)  current_chrooted(): use guards
> 	this pair might be better off with #16 taken to the beginning
> of the series (or to a separate branch merge into this one); no better
> reason to do as I had than wanting to keep the guard infrastructure
> in the very beginning.
> 
> 	Part 2: turning unlock_mount() into __cleanup.
> 
> 	Environment for mounting something on given location consists of:
> 1) namespace_excl scope
> 2) parent mount - the one we'll be attaching things to.
> 3) mountpoint to be, protected from disappearing under us.
> 4) inode of that mountpoint's dentry held exclusive.
> 	Unfortunately, we can't take inode locks in namespace_excl scopes.
> And we want to cope with the possibility that somebody has managed to
> mount something on that place while we'd been taking locks.  "Cope" part
> is simple for finish_automount() ("drop our mount and go away quietly;
> somebody triggered it before we did"), but for everything else it's
> trickier - "use whatever's overmounting that place now (with the right
> locks, please)".
> 	lock_mount() does all of that (do_lock_mount(), actually), with
> unlock_mount() closing the scope.  And it's definitely a good candidate
> for __cleanup()-based approach, except that
> * the damn thing can return an error and conditional variants of that
> infrastructure are too revolting.
> * parent mount is returned in a fucking awful way - we modify the struct
> path passed to us as location to mount on and then its ->mnt is the parent
> to be... except for the "beneath" variant where we play convoluted games
> with "no, here we want the parent of that".  Implementation is also
> vulnerable to umount propagtion races.
> * the structure we set up (everything except the parent) is inserted
> into a linked list by lock_mount().  That excludes DEFINE_CLASS() -
> it wants the value formed and then copied to the variable we are
> defining.
> * it contains an implicit namespace_excl scope, so path_put() and its
> ilk *must* be done after the unlock_mount().  And most of the users have
> gotos past that.
> 	The first two problems are solved by adding an explicit pointer
> to parent mount into struct pinned_mountpoint.	Having lock_mount()
> failure reported by setting it to ERR_PTR(-E...) allows to avoid the
> problem with expressing the constructor failure.  The third one is dealt
> with by defining local macros to be used instead of CLASS - I went with
> LOCK_MOUNT(mp, path) which defines struct pinned_mountpoint mp with
> __cleanup(unlock_mount) and sets it up.  If anybody has better suggestions,
> I'll be glad to hear those.
> 	The last one is dealt with by massaging the users to form that
> would have all post-unlock_mount() stuff done by __free().
> 
> 	First, several trivial cleanups:
> 18/52)  do_move_mount(): trim local variables
> 19/52)  do_move_mount(): deal with the checks on old_path early
> 20/52)  move_mount(2): take sanity checks in 'beneath' case into do_lock_mount()
> 21/52)  finish_automount(): simplify the ELOOP check
> 
> 	Getting rid of post-unlock_mount() stuff:
> 22/52)  do_loopback(): use __free(path_put) to deal with old_path
> 23/52)  pivot_root(2): use __free() to deal with struct path in it
> 24/52)  finish_automount(): take the lock_mount() analogue into a helper
> 	this one turns the open-coded logics into lock_mount_exact() with
> the same kind of calling conventions as lock_mount() and do_lock_mount()
> 25/52)  do_new_mount_rc(): use __free() to deal with dropping mnt on failure
> 26/52)  finish_automount(): use __free() to deal with dropping mnt on failure
> 
> 	This is the main part:
> 27/52)  change calling conventions for lock_mount() et.al.
> 
> 	Followups, cleaning up the games with parent mount in the user:
> 28/52)  do_move_mount(): use the parent mount returned by do_lock_mount()
> 29/52)  do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path
> 30/52)  graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint
> 
> 	Part 3: getting rid of mutating struct path there.
> 
> 	do_lock_mount() is still playing silly buggers with struct path it
> had been given - the logics in that thing hadn't changed.  It's not a pretty
> function and it's racy as well; the thing is, by this point its users have
> almost no use for the changed contents of struct path - dentry can be derived
> from struct mountpoint, parent mount to use is provided directly and we
> want that a lot more than modified path->mnt.  There's only one place
> (in can_move_mount_beneath()) where we still want that and it's not hard
> to reconstruct the value by *original* path->mnt value + parent mount to
> be used.
> 
> 	Getting rid of ->dentry uses.
> 31/52)  pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry
> 32/52)  don't bother passing new_path->dentry to can_move_mount_beneath()
> 
> 	A helper, already open-coded in a couple of places; carved out of
> the next patch to keep it reasonably small
> 33/52)  new helper: topmost_overmount()
> 
> 	Rewrite of do_lock_mount() to keep path constant + trivial change
> in do_move_mount() to adjust the argument it passes to can_move_mount_beneath():
> 34/52)  do_lock_mount(): don't modify path.
> 	
> 
> 	Part 5: a bunch of trivial cleanups (mostly constifications)
> 
> 35/52)  constify check_mnt()
> 36/52)  do_mount_setattr(): constify path argument
> 37/52)  do_set_group(): constify path arguments
> 38/52)  drop_collected_paths(): constify arguments
> 39/52)  collect_paths(): constify the return value
> 40/52)  do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s)
> 41/52)  mnt_warn_timestamp_expiry(): constify struct path argument
> 42/52)  do_new_mount{,_fc}(): constify struct path argument
> 43/52)  do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument
> 44/52)  path_mount(): constify struct path argument
> 45/52)  may_copy_tree(), __do_loopback(): constify struct path argument
> 46/52)  path_umount(): constify struct path argument
> 47/52)  constify can_move_mount_beneath() arguments
> 48/52)  do_move_mount_old(): use __free(path_put)
> 49/52)  do_mount(): use __free(path_put)
> 
> 	Part 6: assorted stuff, will grow.
> 
> 50/52)  umount_tree(): take all victims out of propagation graph at once
> [had been earlier]
> 	For each removed mount we need to calculate where the slaves
> will end up.  To avoid duplicating that work, do it for all mounts to be
> removed at once, taking the mounts themselves out of propagation graph as
> we go, then do all transfers; the duplicate work on finding destinations
> is avoided since if we run into a mount that already had destination
> found, we don't need to trace the rest of the way.  That's guaranteed
> O(removed mounts) for finding destinations and removing from propagation
> graph and O(surviving mounts that have master removed) for transfers.
> 
> 51/52)  ecryptfs: get rid of pointless mount references in ecryptfs dentries
> 	->lower_path.mnt has the same value for all dentries on given
> ecryptfs instance and if somebody goes for mountpoint-crossing variant
> where that would not be true, we can deal with that when it happens
> (and _not_ with duplicating these reference into each dentry).
> 	As it is, we are better off just sticking a reference into
> ecryptfs-private part of superblock and keeping it pinned until
> ->kill_sb().
> 	That way we can stick a reference to underlying dentry right into
> ->d_fsdata of ecryptfs one, getting rid of indirection through struct
> ecryptfs_dentry_info, along with the entire struct ecryptfs_dentry_info
> machinery.
> 
> 52/52)  fs/namespace.c: sanitize descriptions for {__,}lookup_mnt()
> 	Comments regarding "shadow mounts" were stale - no such thing
> anymore.  Document the locking requirements for __lookup_mnt()...
> 
> 
> FWIW, the current diffstat:
> 
>  fs/ecryptfs/dentry.c          |  14 +-
>  fs/ecryptfs/ecryptfs_kernel.h |  27 +-
>  fs/ecryptfs/file.c            |  15 +-
>  fs/ecryptfs/inode.c           |  19 +-
>  fs/ecryptfs/main.c            |  24 +-
>  fs/internal.h                 |   4 +-
>  fs/mount.h                    |  12 +
>  fs/namespace.c                | 775 +++++++++++++++++++-----------------------
>  fs/pnode.c                    |  75 ++--
>  fs/pnode.h                    |   1 +
>  include/linux/mount.h         |   4 +-
>  kernel/audit_tree.c           |  12 +-
>  12 files changed, 464 insertions(+), 518 deletions(-)