[BUG][RFC] broken use of __lookup_mnt() in path_overmounted()

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sat, 31 May 2025 21:57:00 +0100

	More audit fallout.

Rules for using mount hash (will go into the docs I'm putting together):
	* all insertions are under EXCL(namespace_sem) and write_seqlock(mount_lock)
	* all removals are under write_seqlock(mount_lock)
	* all removals are either under EXCL(namespace_sem) or have parent's
refcount equal to zero.
	* since all changes to hash chains are under write_seqlock(mount_lock),
hash seatch is safe if you have at least read_seqlock_excl(mount_lock).  In
that case the result is guaranteed to be accurate.
	* removals are done with hlist_del_init_rcu(); freeing of the removed
object is always RCU-delayed, so holding rcu_read_lock() over traversal is
memory safe.  HOWEVER, it is entirely possible to step into an object being
removed from hash chain at the time of search and get a false negative.
If you are not holding at least read_seqlock_excl(mount_lock), you *must*
sample mount_lock seqcount component before the search and check it afterwards.
	* acquiring a reference to object found as the result of traversal needs
to be done carefully - it is safe under mount_lock, but when used under rcu_read_lock()
alone we do need __legitimize_mnt(); otherwise we are risking a race with
final mntput() and/or busy mount checks in sync umount.

Search is done by __lookup_mnt().
Callers under write_seqlock(mount_lock):
	* all callers in fs/pnode.c and one in attach_recursive_mnt().
Callers under rcu_read_lock():
	* lookup_mnt() - seqcount component of mount_lock used to deal
with races.  Check is done by legitimize_mnt().
	* path_overmounted() - the callers are under EXCL(namespace_sem)
and they are holding a reference to parent, so removal of the object we
are searching for is excluded.  Reference to child is not acquired, so
the questions about validity of that do not arise.  Unfortunately, removals
of objects in the same hash chain are *not* excluded, so a false negative
is possible there.

Bug in path_overmounted() appears to have come from the corresponding
chunk of finish_automount(), which had the same race (and got turned
into a call of path_overmounted()).

One possibility is to wrap the use of __lookup_mnt() into a sample-and-recheck
loop there; for the call of path_overmounted() in finish_automount() it'll
give the right behaviour.

The other one, though...  The thing is, it's not really specific to
"move beneath" case - the secondary copies always have to deal with the
possibility of "slip under existing mount" situation.  And yes, it can
lead to problems - for all attach_recursive_mnt() callers.

Note that do_loopback() can race with mount(2) on top of the old_path
- it might've happened between the pathname resolution for source and
lock_mount() for mountpoint.  We *can't* hold namespace_sem over the
pathname resolution - it'd be very easy to deadlock.

One possibility would be to have all of them traverse the chain of
overmounts on top of the thing we are going to copy/move, basically
pretending that we'd lost the race to whatever had done the overmount.
The problem with that is there might have been no race at all - it
might've come from "." or a descriptor + empty path.  In this case
we currently copy/move the entire thing and it just might be used
deliberately.

Another possibility is to teach attach_recursive_mnt() to cope with the
possibility of overmounts in source - it's not that hard, all we need is
to find the top of overmount chain there in the very beginning and in all
"slip under existing mount" cases have the existing mount reparented to
the top of that chain.

I'm not sure which one gives better semantics, TBH.  I'd be more inclined
towards the second option - on the theory that we can combine it with
the first one in cases where that's applicable.

Comments?