On Mon, Aug 25, 2025 at 05:40:46AM +0100, Al Viro wrote: > Most of this pile is basically an attempt to see how well do > cleanup.h-style mechanisms apply in mount handling. That stuff lives in > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount > Rebased to -rc3 (used to be a bit past -rc2, branched at mount fixes merge) > Individual patches in followups. > > Please, help with review and testing. It seems to survive the > local beating and code generation seems to be OK, but more testing > would be a good thing and I would really like to see comments on that > stuff. Btw, I just realized that basically none of your commits have any lore links in them. That kinda sucks because I very very often just look at a commit and then use the link to jump to the mailing list discussion for more context about a change and how it came about. So pretty please can you start adding lore links to your commits when applying if it's not fucking up your workflow too much? > > This is not all I've got around mount handling, but I'd rather > get that thing out for review before starting to sort out other local > mount-related branches. > > Series overview: > > Part 1: guards. > > This part starts with infrastructure, followed by one-by-one > conversions to the guard/scoped_guard in some of the places that fit > that well enough. Note that one of those places turned out to be taking > mount_lock for no reason whatsoever; I already see places where we do > write_seqlock when read_seqlock_excl would suffice, etc. > > Folks, _please_ don't do any bulk conversions in that area. > IMO one area where RAII becomes dangerous is locking; usually it's not > a big deal to delay freeing some object a bit, but delay dropping a > lock and you risk introducing deadlocks that will be bloody hard to spot. > It _has_ to be done carefully; we had trouble in that area several times > over the last year or so in fs/namespace.c alone. Another fun problem > is that quite a few comments regarding the locking in there are stale. > We still have the comments that talk about mount lock as if it had been > an rwlock-like thing. It hadn't been that for more than a decade now. > It needs to be documented sanely; so do the access rules to the data > structures involved. I hope to get some of that into the tree this cycle, > but it's still in progress. > > 1/52) fs/namespace.c: fix the namespace_sem guard mess > New guards: namespace_excl and namespace_shared. The former implies > the latter, as for anything rwsem-like. No inode locks, no dropping the final > references, no opening files, etc. in scope of those. > 2/52) introduced guards for mount_lock > New guards: mount_writer, mount_locked_reader. That's write_seqlock > and read_seqlock_excl on mount_lock; obviously, nothing blocking should be > done in scope of those. > 3/52) fs/namespace.c: allow to drop vfsmount references via __free(mntput) > Missing DEFINE_FREE (for mntput()); local in fs/namespace.c, to be > used only for keeping shit out of namespace_... and mount_... scopes. > 4/52) __detach_mounts(): use guards > 5/52) __is_local_mountpoint(): use guards > 6/52) do_change_type(): use guards > 7/52) do_set_group(): use guards > 8/52) mark_mounts_for_expiry(): use guards > 9/52) put_mnt_ns(): use guards > 10/52) mnt_already_visible(): use guards > a bunch of clear-cut conversions, with explanations of the reasons > why this or that guard is needed. > 11/52) check_for_nsfs_mounts(): no need to take locks > ... and here we have one where it turns out that locking had been > excessive. Iterating through a subtree in mount_locked_reader scope is > safe, all right, but (1) mount_writer is not needed here at all and (2) > namespace_shared + a reference held to the root of subtree is also enough. > All callers had (2) already. Documented the locking requirements for > function, removed {,un}lock_mount_hash() in it... > 12/52) propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() > This one is interesting - existing code had been equivalent to > scoped_guard(mount_locked_reader), and it's right for that call. However, > mnt_set_mountpoint() generally requires mount_writer - the only reason we > get away with that here is that the mount in question never had been > reachable from the mounts visible to other threads. > 13/52) has_locked_children(): use guards > 14/52) mnt_set_expiry(): use guards > 15/52) path_is_under(): use guards > more clear-cut conversions with explanations. > 16/52) current_chrooted(): don't bother with follow_down_one() > 17/52) current_chrooted(): use guards > this pair might be better off with #16 taken to the beginning > of the series (or to a separate branch merge into this one); no better > reason to do as I had than wanting to keep the guard infrastructure > in the very beginning. > > Part 2: turning unlock_mount() into __cleanup. > > Environment for mounting something on given location consists of: > 1) namespace_excl scope > 2) parent mount - the one we'll be attaching things to. > 3) mountpoint to be, protected from disappearing under us. > 4) inode of that mountpoint's dentry held exclusive. > Unfortunately, we can't take inode locks in namespace_excl scopes. > And we want to cope with the possibility that somebody has managed to > mount something on that place while we'd been taking locks. "Cope" part > is simple for finish_automount() ("drop our mount and go away quietly; > somebody triggered it before we did"), but for everything else it's > trickier - "use whatever's overmounting that place now (with the right > locks, please)". > lock_mount() does all of that (do_lock_mount(), actually), with > unlock_mount() closing the scope. And it's definitely a good candidate > for __cleanup()-based approach, except that > * the damn thing can return an error and conditional variants of that > infrastructure are too revolting. > * parent mount is returned in a fucking awful way - we modify the struct > path passed to us as location to mount on and then its ->mnt is the parent > to be... except for the "beneath" variant where we play convoluted games > with "no, here we want the parent of that". Implementation is also > vulnerable to umount propagtion races. > * the structure we set up (everything except the parent) is inserted > into a linked list by lock_mount(). That excludes DEFINE_CLASS() - > it wants the value formed and then copied to the variable we are > defining. > * it contains an implicit namespace_excl scope, so path_put() and its > ilk *must* be done after the unlock_mount(). And most of the users have > gotos past that. > The first two problems are solved by adding an explicit pointer > to parent mount into struct pinned_mountpoint. Having lock_mount() > failure reported by setting it to ERR_PTR(-E...) allows to avoid the > problem with expressing the constructor failure. The third one is dealt > with by defining local macros to be used instead of CLASS - I went with > LOCK_MOUNT(mp, path) which defines struct pinned_mountpoint mp with > __cleanup(unlock_mount) and sets it up. If anybody has better suggestions, > I'll be glad to hear those. > The last one is dealt with by massaging the users to form that > would have all post-unlock_mount() stuff done by __free(). > > First, several trivial cleanups: > 18/52) do_move_mount(): trim local variables > 19/52) do_move_mount(): deal with the checks on old_path early > 20/52) move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() > 21/52) finish_automount(): simplify the ELOOP check > > Getting rid of post-unlock_mount() stuff: > 22/52) do_loopback(): use __free(path_put) to deal with old_path > 23/52) pivot_root(2): use __free() to deal with struct path in it > 24/52) finish_automount(): take the lock_mount() analogue into a helper > this one turns the open-coded logics into lock_mount_exact() with > the same kind of calling conventions as lock_mount() and do_lock_mount() > 25/52) do_new_mount_rc(): use __free() to deal with dropping mnt on failure > 26/52) finish_automount(): use __free() to deal with dropping mnt on failure > > This is the main part: > 27/52) change calling conventions for lock_mount() et.al. > > Followups, cleaning up the games with parent mount in the user: > 28/52) do_move_mount(): use the parent mount returned by do_lock_mount() > 29/52) do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path > 30/52) graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint > > Part 3: getting rid of mutating struct path there. > > do_lock_mount() is still playing silly buggers with struct path it > had been given - the logics in that thing hadn't changed. It's not a pretty > function and it's racy as well; the thing is, by this point its users have > almost no use for the changed contents of struct path - dentry can be derived > from struct mountpoint, parent mount to use is provided directly and we > want that a lot more than modified path->mnt. There's only one place > (in can_move_mount_beneath()) where we still want that and it's not hard > to reconstruct the value by *original* path->mnt value + parent mount to > be used. > > Getting rid of ->dentry uses. > 31/52) pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry > 32/52) don't bother passing new_path->dentry to can_move_mount_beneath() > > A helper, already open-coded in a couple of places; carved out of > the next patch to keep it reasonably small > 33/52) new helper: topmost_overmount() > > Rewrite of do_lock_mount() to keep path constant + trivial change > in do_move_mount() to adjust the argument it passes to can_move_mount_beneath(): > 34/52) do_lock_mount(): don't modify path. > > > Part 5: a bunch of trivial cleanups (mostly constifications) > > 35/52) constify check_mnt() > 36/52) do_mount_setattr(): constify path argument > 37/52) do_set_group(): constify path arguments > 38/52) drop_collected_paths(): constify arguments > 39/52) collect_paths(): constify the return value > 40/52) do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) > 41/52) mnt_warn_timestamp_expiry(): constify struct path argument > 42/52) do_new_mount{,_fc}(): constify struct path argument > 43/52) do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument > 44/52) path_mount(): constify struct path argument > 45/52) may_copy_tree(), __do_loopback(): constify struct path argument > 46/52) path_umount(): constify struct path argument > 47/52) constify can_move_mount_beneath() arguments > 48/52) do_move_mount_old(): use __free(path_put) > 49/52) do_mount(): use __free(path_put) > > Part 6: assorted stuff, will grow. > > 50/52) umount_tree(): take all victims out of propagation graph at once > [had been earlier] > For each removed mount we need to calculate where the slaves > will end up. To avoid duplicating that work, do it for all mounts to be > removed at once, taking the mounts themselves out of propagation graph as > we go, then do all transfers; the duplicate work on finding destinations > is avoided since if we run into a mount that already had destination > found, we don't need to trace the rest of the way. That's guaranteed > O(removed mounts) for finding destinations and removing from propagation > graph and O(surviving mounts that have master removed) for transfers. > > 51/52) ecryptfs: get rid of pointless mount references in ecryptfs dentries > ->lower_path.mnt has the same value for all dentries on given > ecryptfs instance and if somebody goes for mountpoint-crossing variant > where that would not be true, we can deal with that when it happens > (and _not_ with duplicating these reference into each dentry). > As it is, we are better off just sticking a reference into > ecryptfs-private part of superblock and keeping it pinned until > ->kill_sb(). > That way we can stick a reference to underlying dentry right into > ->d_fsdata of ecryptfs one, getting rid of indirection through struct > ecryptfs_dentry_info, along with the entire struct ecryptfs_dentry_info > machinery. > > 52/52) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() > Comments regarding "shadow mounts" were stale - no such thing > anymore. Document the locking requirements for __lookup_mnt()... > > > FWIW, the current diffstat: > > fs/ecryptfs/dentry.c | 14 +- > fs/ecryptfs/ecryptfs_kernel.h | 27 +- > fs/ecryptfs/file.c | 15 +- > fs/ecryptfs/inode.c | 19 +- > fs/ecryptfs/main.c | 24 +- > fs/internal.h | 4 +- > fs/mount.h | 12 + > fs/namespace.c | 775 +++++++++++++++++++----------------------- > fs/pnode.c | 75 ++-- > fs/pnode.h | 1 + > include/linux/mount.h | 4 +- > kernel/audit_tree.c | 12 +- > 12 files changed, 464 insertions(+), 518 deletions(-)