Re: [BUG][RFC] broken use of __lookup_mnt() in path_overmounted()

Christian Brauner <brauner@xxxxxxxxxx> · Mon, 2 Jun 2025 11:17:25 +0200

On Sat, May 31, 2025 at 09:57:00PM +0100, Al Viro wrote:
> 	More audit fallout.
> 
> Rules for using mount hash (will go into the docs I'm putting together):
> 	* all insertions are under EXCL(namespace_sem) and write_seqlock(mount_lock)
> 	* all removals are under write_seqlock(mount_lock)
> 	* all removals are either under EXCL(namespace_sem) or have parent's
> refcount equal to zero.
> 	* since all changes to hash chains are under write_seqlock(mount_lock),
> hash seatch is safe if you have at least read_seqlock_excl(mount_lock).  In
> that case the result is guaranteed to be accurate.
> 	* removals are done with hlist_del_init_rcu(); freeing of the removed
> object is always RCU-delayed, so holding rcu_read_lock() over traversal is
> memory safe.  HOWEVER, it is entirely possible to step into an object being
> removed from hash chain at the time of search and get a false negative.
> If you are not holding at least read_seqlock_excl(mount_lock), you *must*
> sample mount_lock seqcount component before the search and check it afterwards.
> 	* acquiring a reference to object found as the result of traversal needs
> to be done carefully - it is safe under mount_lock, but when used under rcu_read_lock()
> alone we do need __legitimize_mnt(); otherwise we are risking a race with
> final mntput() and/or busy mount checks in sync umount.
> 
> Search is done by __lookup_mnt().
> Callers under write_seqlock(mount_lock):
> 	* all callers in fs/pnode.c and one in attach_recursive_mnt().
> Callers under rcu_read_lock():
> 	* lookup_mnt() - seqcount component of mount_lock used to deal
> with races.  Check is done by legitimize_mnt().
> 	* path_overmounted() - the callers are under EXCL(namespace_sem)
> and they are holding a reference to parent, so removal of the object we
> are searching for is excluded.  Reference to child is not acquired, so
> the questions about validity of that do not arise.  Unfortunately, removals
> of objects in the same hash chain are *not* excluded, so a false negative
> is possible there.
> 
> Bug in path_overmounted() appears to have come from the corresponding
> chunk of finish_automount(), which had the same race (and got turned
> into a call of path_overmounted()).
> 
> One possibility is to wrap the use of __lookup_mnt() into a sample-and-recheck
> loop there; for the call of path_overmounted() in finish_automount() it'll
> give the right behaviour.
> 
> The other one, though...  The thing is, it's not really specific to
> "move beneath" case - the secondary copies always have to deal with the
> possibility of "slip under existing mount" situation.  And yes, it can
> lead to problems - for all attach_recursive_mnt() callers.

Fwiw, I have pointed this out in one of my really early submission of
this work and had asked whether we generally want the same check. That's
also why I added the shadow-mount check into the automount path because
that could be used for that sort of issue to afair but my memory is
fuzzy here.

> 
> Note that do_loopback() can race with mount(2) on top of the old_path
> - it might've happened between the pathname resolution for source and
> lock_mount() for mountpoint.  We *can't* hold namespace_sem over the
> pathname resolution - it'd be very easy to deadlock.

Yes.

> 
> One possibility would be to have all of them traverse the chain of
> overmounts on top of the thing we are going to copy/move, basically
> pretending that we'd lost the race to whatever had done the overmount.
> The problem with that is there might have been no race at all - it
> might've come from "." or a descriptor + empty path.  In this case
> we currently copy/move the entire thing and it just might be used
> deliberately.
> 
> Another possibility is to teach attach_recursive_mnt() to cope with the
> possibility of overmounts in source - it's not that hard, all we need is
> to find the top of overmount chain there in the very beginning and in all
> "slip under existing mount" cases have the existing mount reparented to
> the top of that chain.

I'm seeing you have already sent a second mail so I take it you made a
decision already but the second option seems sanest to me.