Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: > reproducer: > ------------------------------------------------------------ > # create a playground > mkdir /tmp/foo > mount -t tmpfs none /tmp/foo > mount --make-private /tmp/foo > cd /tmp/foo > > # set one-way propagation from A to B > mkdir A > mkdir B > mount -t tmpfs none A > mount --make-shared A > mount --bind A B > mount --make-slave B > > # A/1 -> B/1, A/1/2 -> B/1/2 > mkdir A/1 > mount -t tmpfs none A/1 > mkdir A/1/2 > mount -t tmpfs none A/1/2 > > # overmount the entire B/1/2 > mount -t tmpfs none B/1/2 > > # make sure it's busy - set a mount at B/1/2/x > mkdir B/1/2/x > mount -t tmpfs none B/1/2/x > > stat B/1/x # shouldn't exist > > umount -l A/1 > > stat B/1/x # ... and now it does > ------------------------------------------------------------ > > What happens is that mounts on B/1 and B/1/2 had been considered > as victims - and taken out, since the overmount on top of B/1/2 > overmounted the root of the first mount on B/1/2 and it got > reparented - all the way to B/1. Yes, that behavior is incorrect since it causes a userspace visible change on where the mount is visible. > Correct behaviour would be to have B/1 left in place and upper > B/1/2 to be reparented once. As I read __propagate_umount that is what is trying to be implemented. I am a bit mystified why the semantics aren't simply to lazily umount (aka MNT_DETACH) that overmount. But that is not what the code is trying to do. It probably isn't worth considering a change in semantics at this point. It looks like the challenge is in the __propgate_umount loop. If I am reading thing correctly: - __propagate_umount recognizes that B/1/2 can be unmounted from the overmount. - The code then considers the parent of B/1/2 B/1. - When considering B/1 there is only one child B/1/2 that has been umounted and it has already been marked to be umounted so it is ignored (Sigh). So a minimal fix would go up the mount pile to see if there is anything that must remain. Or probably use a bit like use a bit like MNT_MARK to recognize there is an overmount remaining. So despite a child being unmounted it still should count as if it was mounted. > As an aside, that's a catch from several days of attempts to prove > correctness of propagate_umount(); I'm still not sure there's > nothing else wrong with it. _Maybe_ it's the only problem in > there, but reconstructing the proof of correctness has turned > out to be a real bitch ;-/ > > I seriously suspect that a lot of headache comes from trying > to combine collecting the full set of potential victims with > deciding what can and what can not be taken out - gathering > all of them first would simplify things. First pass would've > logics similar to your variant, but without __propagate_umount() > part[*] This is there be dragons talk. With out care it is easy to get the code to go non-linear in the number of mounts. That said I don't see any immediate problem with a first pass without my __propgate_umount. As I read the current code the __propagate_umount loop is just about propagating the information up from the leaves. > After the set is collected, we could go through it, doing the > something along the lines of > how = 0 > for each child in children(m) > if child in set > continue > how = 1 > if child is not mounted on root > how = 2 > break > if how == 2 > kick_out_ancestors(m) > remove m itself from set // needs to cooperate with outer loop > else if how == 1 > for (p = m; p in set && p is mounted on root; p = p->mnt_parent) > ; > if p in set > kick_out_ancestors(p) > else if children(m) is empty && m is not locked // to optimize things a bit > commit to unmounting m (again, needs to cooperate with the outer loop) > > "Cooperate with the outer loop" might mean something like > having this per-element work leave removal of its argument to > caller and report whether its argument needs to be removed. > > After that we'd be left with everything still in the set > having no out-of-set children that would be obstacles. > The only thing that remains after that is MNT_LOCKED and > that's as simple as > while set is not empty > m = first element of set > for (p = m; p is locked && p in set; p = p->mnt_parent) > ; > if p not in set { > if p is not committed to unmount > remove everything from m to p from set > continue > } else { > p = p->mnt_parent > } > commit everything from m to p to unmount, removing from set > > I'll try to put something of that sort together, along with > detailed explanation of what it's doing - in D/f/*, rather than > buring it in commit messages, and along with "read and update > D/f/... if you are ever touch this function" in fs/pnode.c itself; > this fun is not something I would like to repeat several years > down the road ;-/ I think I understand what you are saying. But I will have to see the actually code. > We *REALLY* need a good set of regression tests for that stuff. > If you have anything along those lines sitting somewhere, please > post a reference. The same goes for everybody else who might > have something in that general area. I will have to look. As I recall everything I have is completely manual but it could be a starting point at least. > [*] well, that and with fixed skip_propagation_subtree() logics; it's > easier to combine it with propagation_next() rather than trying to set > the things up so that the next call of propagation_next() would DTRT - > it's less work and yours actually has a corner case if the last element > of ->mnt_slave_list has non-empty ->mnt_slave_list itself. I hope that helps a little, Eric