On Sat, May 31, 2025 at 09:57:00PM +0100, Al Viro wrote: > More audit fallout. > > Rules for using mount hash (will go into the docs I'm putting together): > * all insertions are under EXCL(namespace_sem) and write_seqlock(mount_lock) > * all removals are under write_seqlock(mount_lock) > * all removals are either under EXCL(namespace_sem) or have parent's > refcount equal to zero. > * since all changes to hash chains are under write_seqlock(mount_lock), > hash seatch is safe if you have at least read_seqlock_excl(mount_lock). In > that case the result is guaranteed to be accurate. > * removals are done with hlist_del_init_rcu(); freeing of the removed > object is always RCU-delayed, so holding rcu_read_lock() over traversal is > memory safe. HOWEVER, it is entirely possible to step into an object being > removed from hash chain at the time of search and get a false negative. > If you are not holding at least read_seqlock_excl(mount_lock), you *must* > sample mount_lock seqcount component before the search and check it afterwards. > * acquiring a reference to object found as the result of traversal needs > to be done carefully - it is safe under mount_lock, but when used under rcu_read_lock() > alone we do need __legitimize_mnt(); otherwise we are risking a race with > final mntput() and/or busy mount checks in sync umount. > > Search is done by __lookup_mnt(). > Callers under write_seqlock(mount_lock): > * all callers in fs/pnode.c and one in attach_recursive_mnt(). > Callers under rcu_read_lock(): > * lookup_mnt() - seqcount component of mount_lock used to deal > with races. Check is done by legitimize_mnt(). > * path_overmounted() - the callers are under EXCL(namespace_sem) > and they are holding a reference to parent, so removal of the object we > are searching for is excluded. Reference to child is not acquired, so > the questions about validity of that do not arise. Unfortunately, removals > of objects in the same hash chain are *not* excluded, so a false negative > is possible there. > > Bug in path_overmounted() appears to have come from the corresponding > chunk of finish_automount(), which had the same race (and got turned > into a call of path_overmounted()). > > One possibility is to wrap the use of __lookup_mnt() into a sample-and-recheck > loop there; for the call of path_overmounted() in finish_automount() it'll > give the right behaviour. > > The other one, though... The thing is, it's not really specific to > "move beneath" case - the secondary copies always have to deal with the > possibility of "slip under existing mount" situation. And yes, it can > lead to problems - for all attach_recursive_mnt() callers. Fwiw, I have pointed this out in one of my really early submission of this work and had asked whether we generally want the same check. That's also why I added the shadow-mount check into the automount path because that could be used for that sort of issue to afair but my memory is fuzzy here. > > Note that do_loopback() can race with mount(2) on top of the old_path > - it might've happened between the pathname resolution for source and > lock_mount() for mountpoint. We *can't* hold namespace_sem over the > pathname resolution - it'd be very easy to deadlock. Yes. > > One possibility would be to have all of them traverse the chain of > overmounts on top of the thing we are going to copy/move, basically > pretending that we'd lost the race to whatever had done the overmount. > The problem with that is there might have been no race at all - it > might've come from "." or a descriptor + empty path. In this case > we currently copy/move the entire thing and it just might be used > deliberately. > > Another possibility is to teach attach_recursive_mnt() to cope with the > possibility of overmounts in source - it's not that hard, all we need is > to find the top of overmount chain there in the very beginning and in all > "slip under existing mount" cases have the existing mount reparented to > the top of that chain. I'm seeing you have already sent a second mail so I take it you made a decision already but the second option seems sanest to me.