Re: [PATCH 02/52] introduced guards for mount_lock

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 25 Aug 2025 21:21:41 +0100

On Mon, Aug 25, 2025 at 02:46:04PM +0100, Al Viro wrote:

> Basically, there are 3 kinds of contexts here:
> 	1) lockless, must be under RCU, fairly limited in which pointers they
> can traverse, read-only access to structures in question.  Must sample
> the seqcount side of mount_lock first, then verifying that it has not changed
> after everything.
> 
> 	2) hold the spinlock side of mount_lock, _without_ bumping the seqcount
> one.  Can be used for reads and writes, as long as the stuff being modified
> is not among the things that is traversed locklessly.  Do not disrupt the previous
> class, have full exclusion with calles 2 and 3
> 
> 	3) hold the spinlock side of mount_lock, and bump the seqcount one on
> entry and leave.  Any reads and writes.  Full exclusion with classes 2 and 3,
> invalidates the checks for class 1 (i.e. will push it into retries/fallbacks/
> whatnot).

FWIW, partial dump from what I hope to push out as docs:

	* all modifications of mount hash chains must be mount_writer.
	* only one function is allowed to traverse hash chains - __lookup_mnt().
Important part here is reachability - hash is a shared data structure, but
a struct mount instance can be reached that way only if it has parent equal
to the argument you've been able to pass to __lookup_mnt().
	* callers of __lookup_mnt() must either be at least mount_locked_reader
OR hold rcu_read_lock through the entire thing, sample the seqcount side of
mount_lock before the call, validate it afterwards and discard the attempt
entirely if validation fails.  Note that __legitimize_mnt() contains validation.
	* being hashed contributes 1 to refcount.

	* (sub)tree topology (encoded in ->mnt_parent, ->mnt_mounts/->mnt_child,
->mnt_mp, ->mnt_mountpoint and ->overmount) is stabilize by either mount_locked_reader
OR by namespace_shared + positive refcount for root of subtree.
	namespace_shared by itself is *NOT* enough.  When the last reference to
mount past the umount_tree() (i.e. already with NULL ->mnt_ns) goes away, anything
subtree stuck to it will be detached from it and have its root unhashed and dropped.
In other words, such tree (e.g. result of umount -l) decays from root to leaves -
once all references to root are gone, it's cut off and all pieces are left
to decay.  That is done with mount_writer (has to be - there are mount hash changes
and for those mount_writer is a hard requirement) and only after the final reference
to root has been dropped.
	All other topology changes happen with namespace_excl and, at least,
mount_locked_reader.  Normally - with mount_writer; the only exception is that
setting parent for a newly allocated subtree is fine with mount_locked_reader;
we are not hashing it yet (that's done only in commit_tree()), so there's no
need to disrupt the lockless readers; note that RCU pathwalk *is* such, so
blind use of mount_writer has an effect on performance.
	->mnt_mounts/->mnt_child is never traversed unless the tree is stabilized
by either lock (note that list modifications there are not with ..._rcu() primitives).
->overmount, ->mnt_parent and ->mnt_mountpoint can be; those need sample/validate
on the seqcount side; it *would* require mount_write from those who modify them,
except that for the ones that had never been reachable yet we don't need to bother.
In practice, ->overmount is changed along with the mount hash, so we need mount_writer
anyway; ->mnt_parent/->mnt_mountpoint/->mnt_mp need it only for reachable mounts.
[[
	FWIW, I'm considering the possibility of having copy_tree() delay
hashing all nodes in the copy and having them hashed all at once; fewer disruptions
for lockless readers that way.  All nodes in the copy are reachable only for the
caller; we do need mount_locked_reader for attaching a new node to copy (it has
to be inserted into the per-mountpoint lists of mounts), but we don't need to
bump the seqcount every time - and we can't hold a spinlock over allocations.
It's not even that hard; all we'd need is a bit of a change in commit_tree()
and in a couple of places where we create a namespace with more than one node -
we have the loops in those places already where we insert the mounts into
per-namespace rbtrees; same loops could handle hashing them.
]]

	* propagation graph (->mnt_share, ->mnt_slave/->mnt_slave_list,
->mnt_master, ->mnt_group_id, IS_MNT_SHARED()) is modified only under
namespace_excl; all accesses are under at least namespace_shared.
Only mounts that belong to a namespace may be reached via those;
umount_tree() removed all victims from the graph before it returns
and it's impossible to include something that isn't a part of some
namespace into the graph afterwards.

	* ->mnt_expire is accessed (both traversals and modifications)
under mount_locked_reader.  No lockless traversals there.

	* per-namespace rbtree (->mnt_node linkage) is modified only
under namespace_excl and all traversals are at least namespace_shared.
Mount leaving a namespace is removed from that before the end of
namespace_excl scope.

	* ->mnt_root and ->mnt_sb are assign-once; never changed.  So's
->mnt_devname, ->mnt_id and ->mnt_id_unique.

	* per-mountpoint mount lists (->mnt_mp_list) are mount_locked_reader
for all accesses (modification and traversal along).

	* ->prev_ns is a fucking mess.

	* ->mnt_umount has only transient uses; umount_tree() uses it
to link the victims to be dropped at namespace_unlock(), final mntput
links the stuck children into a list stashed into ->mnt_stuch_children,
also for eventual dropping (by cleanup_mnt()).  mount_writer for gathering
them into those, nothing for "dissolve and drop everything on the list" -
in both cases the lists are visible only to a single thread by that point.