Re: [PATCH 02/52] introduced guards for mount_lock

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Tue, 26 Aug 2025 00:44:13 +0100

On Mon, Aug 25, 2025 at 09:21:41PM +0100, Al Viro wrote:

> 	FWIW, I'm considering the possibility of having copy_tree() delay
> hashing all nodes in the copy and having them hashed all at once; fewer disruptions
> for lockless readers that way.  All nodes in the copy are reachable only for the
> caller; we do need mount_locked_reader for attaching a new node to copy (it has
> to be inserted into the per-mountpoint lists of mounts), but we don't need to
> bump the seqcount every time - and we can't hold a spinlock over allocations.
> It's not even that hard; all we'd need is a bit of a change in commit_tree()
> and in a couple of places where we create a namespace with more than one node -
> we have the loops in those places already where we insert the mounts into
> per-namespace rbtrees; same loops could handle hashing them.

The main issue I'm having with that is that currently "in list of children" implies
"hashed"; equivalent, even, except for a transient state seen only in mount_writer.
OTOH, having that not true for unreachable mounts...  I'm trying to find anything
that might care, but I don't see any candidates.

It would be nice to have regardless of doing fewer mount_lock seqcount bumps -
better isolation from shared data structures until we glue them in place would
make for simpler correctness proofs...

Anyway,
	copy_tree() call chains:
1.  copy_tree() <- propagate_mnt() <- attach_recursive_mnt(), with the call
chain prior to that point being one the
		<- graft_tree() <- do_loopback()
		<- graft_tree() <- do_add_mount() <- do_new_mount_fc()
		<- graft_tree() <- do_add_mount() <- finish_automount()
		<- do_move_mount().
All of those start inside a lock_mount scope.
Result gets passed (prior to return from attach_recursive_mnt(), within
an mnt_writer scope there) either to commit_tree() or to umount_tree(),
without having been visible to others prior to that.
	That's creation of secondary copies from mount propagation, for
various pathways to mounting stuff.

2.  copy_tree() <- __do_loopback() <- do_loopback().  Inside a lock_mount scope.
Result gets passed into graft_tree() -> attach_recursive_mnt().  In the latter
either it gets passed to commit_tree() (within mount_writer scope, without
having been visible to others prior to that), in which case success is reported,
or it is left alone and error gets reported; in that case back in do_loopback()
it gets passed to umount_tree(), again in mount_writer scope and without having
been visible to others prior to that.
	That's MS_BIND|MS_REC mount(2).

3.  copy_tree() <- __do_loopback() <- open_detached_copy().  In namespace_excl
scope.  Result is fed through a loop that inserts those mounts into rbtree
of new namespace (in mount_writer scope) and its root is stored as ->root
of that new namespace.  Once out of namespace_excl scope, the tree becomes
visible (and an extra reference is attached to the file we are opening).
	That's open_tree(2)/open_tree_attr(2) with OPEN_TREE_CLONE.
	BTW, a bit of mystery there: insertions into rbtree don't need to be in
mount_writer - we do have places where it's done without that, all readers are
in namespace_shared scopes *and* the namespace, along with its rbtree, is not
visible to anyone yet to start with.  If we delay hashing until there it will
need mount_writer, though.

4.  copy_tree() <- copy_mnt_ns().  In namespace_excl scope.  Somewhat similar
to the previous, but the namespace is not an anonymous one and we have a couple
of extra passes - one might do lock_mnt_tree() (under mount_writer, almost
certainly excessive - mount_locked_reader would do just fine) and another
(combined with rbtree insertions) finds the counterparts of root and pwd of
the caller and flips over to those.  Old ones get dropped after we leave
the scope.

Looks like we should be able to unify quite a bit of logics in populating
a new namespace and yes, delaying hash insertions past copy_tree() looks
plausible...

	Incidentally, destruction of new namespace on copy_tree() failure
is another mystery: here we do
                ns_free_inum(&new_ns->ns);
		dec_mnt_namespaces(new_ns->ucounts);
		mnt_ns_release(new_ns);
and in open_detached_copy() it's
	free_mnt_ns(ns);

They are similar - free_mnt_ns() is
	if (!is_anon_ns(ns))
		ns_free_inum(&ns->ns);
	dec_mnt_namespaces(ns->ucounts);
	mnt_ns_tree_remove(ns);
and mnt_ns_tree_remove() is a bunch of !is_anon_ns() code, followed by
an rcu-delayed mnt_ns_release().  So in case of open_detached_copy(),
where the namespace is anonymous, it boils down to an RCU-delayed
call of mnt_ns_release()...

AFAICS the only possible reasons not to use free_mnt_ns() here are
	1) avoiding an RCU-delayed call and
	2) conditional removal of ns from mnt_ns_tree.

As for the second, couldn't we simply use !list_empty(&ns->mnt_ns_list)
as a condition?  And avoiding an RCU delay... nice, in principle, but
the case when that would've saved us anything is CLONE_NEWNS clone(2) or
unshare(2) failing due to severe OOM.  Do we give a damn about one extra
call_rcu() for each of such failures?

mnt_ns_tree handling is your code; do you see any problems with

static void mnt_ns_tree_remove(struct mnt_namespace *ns)
{
	/* remove from global mount namespace list */
	if (!list_empty(&ns->mnt_ns_list)) {
		mnt_ns_tree_write_lock();
		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
		list_bidir_del_rcu(&ns->mnt_ns_list);
		mnt_ns_tree_write_unlock();
	}

	call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
}
and
	mnt = __do_loopback(path, recursive);
	if (IS_ERR(mnt)) {
		emptied_ns = ns;
		namespace_unlock();
		return ERR_CAST(mnt);
	}
in open_detached_copy() and
	new = copy_tree(old, old->mnt.mnt_root, copy_flags);
	if (IS_ERR(new)) {
		emptied_ns = new_ns;
		namespace_unlock();
		return ERR_CAST(new);
	}
in copy_mnt_ns()?