On Mon, Aug 25, 2025 at 02:46:04PM +0100, Al Viro wrote: > Basically, there are 3 kinds of contexts here: > 1) lockless, must be under RCU, fairly limited in which pointers they > can traverse, read-only access to structures in question. Must sample > the seqcount side of mount_lock first, then verifying that it has not changed > after everything. > > 2) hold the spinlock side of mount_lock, _without_ bumping the seqcount > one. Can be used for reads and writes, as long as the stuff being modified > is not among the things that is traversed locklessly. Do not disrupt the previous > class, have full exclusion with calles 2 and 3 > > 3) hold the spinlock side of mount_lock, and bump the seqcount one on > entry and leave. Any reads and writes. Full exclusion with classes 2 and 3, > invalidates the checks for class 1 (i.e. will push it into retries/fallbacks/ > whatnot). FWIW, partial dump from what I hope to push out as docs: * all modifications of mount hash chains must be mount_writer. * only one function is allowed to traverse hash chains - __lookup_mnt(). Important part here is reachability - hash is a shared data structure, but a struct mount instance can be reached that way only if it has parent equal to the argument you've been able to pass to __lookup_mnt(). * callers of __lookup_mnt() must either be at least mount_locked_reader OR hold rcu_read_lock through the entire thing, sample the seqcount side of mount_lock before the call, validate it afterwards and discard the attempt entirely if validation fails. Note that __legitimize_mnt() contains validation. * being hashed contributes 1 to refcount. * (sub)tree topology (encoded in ->mnt_parent, ->mnt_mounts/->mnt_child, ->mnt_mp, ->mnt_mountpoint and ->overmount) is stabilize by either mount_locked_reader OR by namespace_shared + positive refcount for root of subtree. namespace_shared by itself is *NOT* enough. When the last reference to mount past the umount_tree() (i.e. already with NULL ->mnt_ns) goes away, anything subtree stuck to it will be detached from it and have its root unhashed and dropped. In other words, such tree (e.g. result of umount -l) decays from root to leaves - once all references to root are gone, it's cut off and all pieces are left to decay. That is done with mount_writer (has to be - there are mount hash changes and for those mount_writer is a hard requirement) and only after the final reference to root has been dropped. All other topology changes happen with namespace_excl and, at least, mount_locked_reader. Normally - with mount_writer; the only exception is that setting parent for a newly allocated subtree is fine with mount_locked_reader; we are not hashing it yet (that's done only in commit_tree()), so there's no need to disrupt the lockless readers; note that RCU pathwalk *is* such, so blind use of mount_writer has an effect on performance. ->mnt_mounts/->mnt_child is never traversed unless the tree is stabilized by either lock (note that list modifications there are not with ..._rcu() primitives). ->overmount, ->mnt_parent and ->mnt_mountpoint can be; those need sample/validate on the seqcount side; it *would* require mount_write from those who modify them, except that for the ones that had never been reachable yet we don't need to bother. In practice, ->overmount is changed along with the mount hash, so we need mount_writer anyway; ->mnt_parent/->mnt_mountpoint/->mnt_mp need it only for reachable mounts. [[ FWIW, I'm considering the possibility of having copy_tree() delay hashing all nodes in the copy and having them hashed all at once; fewer disruptions for lockless readers that way. All nodes in the copy are reachable only for the caller; we do need mount_locked_reader for attaching a new node to copy (it has to be inserted into the per-mountpoint lists of mounts), but we don't need to bump the seqcount every time - and we can't hold a spinlock over allocations. It's not even that hard; all we'd need is a bit of a change in commit_tree() and in a couple of places where we create a namespace with more than one node - we have the loops in those places already where we insert the mounts into per-namespace rbtrees; same loops could handle hashing them. ]] * propagation graph (->mnt_share, ->mnt_slave/->mnt_slave_list, ->mnt_master, ->mnt_group_id, IS_MNT_SHARED()) is modified only under namespace_excl; all accesses are under at least namespace_shared. Only mounts that belong to a namespace may be reached via those; umount_tree() removed all victims from the graph before it returns and it's impossible to include something that isn't a part of some namespace into the graph afterwards. * ->mnt_expire is accessed (both traversals and modifications) under mount_locked_reader. No lockless traversals there. * per-namespace rbtree (->mnt_node linkage) is modified only under namespace_excl and all traversals are at least namespace_shared. Mount leaving a namespace is removed from that before the end of namespace_excl scope. * ->mnt_root and ->mnt_sb are assign-once; never changed. So's ->mnt_devname, ->mnt_id and ->mnt_id_unique. * per-mountpoint mount lists (->mnt_mp_list) are mount_locked_reader for all accesses (modification and traversal along). * ->prev_ns is a fucking mess. * ->mnt_umount has only transient uses; umount_tree() uses it to link the victims to be dropped at namespace_unlock(), final mntput links the stuck children into a list stashed into ->mnt_stuch_children, also for eventual dropping (by cleanup_mnt()). mount_writer for gathering them into those, nothing for "dissolve and drop everything on the list" - in both cases the lists are visible only to a single thread by that point.