More audit fallout. Rules for using mount hash (will go into the docs I'm putting together): * all insertions are under EXCL(namespace_sem) and write_seqlock(mount_lock) * all removals are under write_seqlock(mount_lock) * all removals are either under EXCL(namespace_sem) or have parent's refcount equal to zero. * since all changes to hash chains are under write_seqlock(mount_lock), hash seatch is safe if you have at least read_seqlock_excl(mount_lock). In that case the result is guaranteed to be accurate. * removals are done with hlist_del_init_rcu(); freeing of the removed object is always RCU-delayed, so holding rcu_read_lock() over traversal is memory safe. HOWEVER, it is entirely possible to step into an object being removed from hash chain at the time of search and get a false negative. If you are not holding at least read_seqlock_excl(mount_lock), you *must* sample mount_lock seqcount component before the search and check it afterwards. * acquiring a reference to object found as the result of traversal needs to be done carefully - it is safe under mount_lock, but when used under rcu_read_lock() alone we do need __legitimize_mnt(); otherwise we are risking a race with final mntput() and/or busy mount checks in sync umount. Search is done by __lookup_mnt(). Callers under write_seqlock(mount_lock): * all callers in fs/pnode.c and one in attach_recursive_mnt(). Callers under rcu_read_lock(): * lookup_mnt() - seqcount component of mount_lock used to deal with races. Check is done by legitimize_mnt(). * path_overmounted() - the callers are under EXCL(namespace_sem) and they are holding a reference to parent, so removal of the object we are searching for is excluded. Reference to child is not acquired, so the questions about validity of that do not arise. Unfortunately, removals of objects in the same hash chain are *not* excluded, so a false negative is possible there. Bug in path_overmounted() appears to have come from the corresponding chunk of finish_automount(), which had the same race (and got turned into a call of path_overmounted()). One possibility is to wrap the use of __lookup_mnt() into a sample-and-recheck loop there; for the call of path_overmounted() in finish_automount() it'll give the right behaviour. The other one, though... The thing is, it's not really specific to "move beneath" case - the secondary copies always have to deal with the possibility of "slip under existing mount" situation. And yes, it can lead to problems - for all attach_recursive_mnt() callers. Note that do_loopback() can race with mount(2) on top of the old_path - it might've happened between the pathname resolution for source and lock_mount() for mountpoint. We *can't* hold namespace_sem over the pathname resolution - it'd be very easy to deadlock. One possibility would be to have all of them traverse the chain of overmounts on top of the thing we are going to copy/move, basically pretending that we'd lost the race to whatever had done the overmount. The problem with that is there might have been no race at all - it might've come from "." or a descriptor + empty path. In this case we currently copy/move the entire thing and it just might be used deliberately. Another possibility is to teach attach_recursive_mnt() to cope with the possibility of overmounts in source - it's not that hard, all we need is to find the top of overmount chain there in the very beginning and in all "slip under existing mount" cases have the existing mount reparented to the top of that chain. I'm not sure which one gives better semantics, TBH. I'd be more inclined towards the second option - on the theory that we can combine it with the first one in cases where that's applicable. Comments?