Re: [BUG] propagate_umount() breakage

"Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> · Sun, 11 May 2025 23:50:40 -0500

Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:

> reproducer:
> ------------------------------------------------------------
> # create a playground
> mkdir /tmp/foo
> mount -t tmpfs none /tmp/foo
> mount --make-private /tmp/foo
> cd /tmp/foo
>
> # set one-way propagation from A to B
> mkdir A
> mkdir B
> mount -t tmpfs none A
> mount --make-shared A
> mount --bind A B
> mount --make-slave B
>
> # A/1 -> B/1, A/1/2 -> B/1/2
> mkdir A/1
> mount -t tmpfs none A/1
> mkdir A/1/2
> mount -t tmpfs none A/1/2
>
> # overmount the entire B/1/2
> mount -t tmpfs none B/1/2
>
> # make sure it's busy - set a mount at B/1/2/x
> mkdir B/1/2/x
> mount -t tmpfs none B/1/2/x
>
> stat B/1/x # shouldn't exist
>
> umount -l A/1
>
> stat B/1/x # ... and now it does
> ------------------------------------------------------------
>
> What happens is that mounts on B/1 and B/1/2 had been considered
> as victims - and taken out, since the overmount on top of B/1/2
> overmounted the root of the first mount on B/1/2 and it got
> reparented - all the way to B/1.

Yes, that behavior is incorrect since it causes a userspace visible
change on where the mount is visible.

> Correct behaviour would be to have B/1 left in place and upper
> B/1/2 to be reparented once.

As I read __propagate_umount that is what is trying to be implemented.

I am a bit mystified why the semantics aren't simply to lazily umount
(aka MNT_DETACH) that overmount.  But that is not what the code is
trying to do.  It probably isn't worth considering a change in semantics
at this point.

It looks like the challenge is in the __propgate_umount loop.
If I am reading thing correctly:
- __propagate_umount recognizes that B/1/2 can be unmounted from the
  overmount.
- The code then considers the parent of B/1/2 B/1.
- When considering B/1 there is only one child B/1/2 that has been
  umounted and it has already been marked to be umounted so it
  is ignored (Sigh).

So a minimal fix would go up the mount pile to see if there is anything
that must remain.  Or probably use a bit like use a bit like MNT_MARK to
recognize there is an overmount remaining.  So despite a child being
unmounted it still should count as if it was mounted.

> As an aside, that's a catch from several days of attempts to prove
> correctness of propagate_umount(); I'm still not sure there's
> nothing else wrong with it.  _Maybe_ it's the only problem in
> there, but reconstructing the proof of correctness has turned
> out to be a real bitch ;-/
>
> I seriously suspect that a lot of headache comes from trying
> to combine collecting the full set of potential victims with
> deciding what can and what can not be taken out - gathering
> all of them first would simplify things.  First pass would've
> logics similar to your variant, but without __propagate_umount()
> part[*]

This is there be dragons talk.

With out care it is easy to get the code to go non-linear in
the number of mounts.

That said I don't see any immediate problem with a first pass
without my __propgate_umount.

As I read the current code the __propagate_umount loop is just
about propagating the information up from the leaves.

> After the set is collected, we could go through it, doing the
> something along the lines of
> 	how = 0
> 	for each child in children(m)
> 		if child in set
> 			continue
> 		how = 1
> 		if child is not mounted on root
> 			how = 2
> 			break
> 	if how == 2
> 		kick_out_ancestors(m)
> 		remove m itself from set // needs to cooperate with outer loop
> 	else if how == 1
> 		for (p = m; p in set && p is mounted on root; p = p->mnt_parent)
> 			;
> 		if p in set
> 			kick_out_ancestors(p)
> 	else if children(m) is empty && m is not locked	// to optimize things a bit
> 		commit to unmounting m (again, needs to cooperate with the outer loop)
>
> "Cooperate with the outer loop" might mean something like
> having this per-element work leave removal of its argument to
> caller and report whether its argument needs to be removed.
>
> After that we'd be left with everything still in the set
> having no out-of-set children that would be obstacles.
> The only thing that remains after that is MNT_LOCKED and
> that's as simple as
> 	while set is not empty
> 		m = first element of set
> 		for (p = m; p is locked && p in set; p = p->mnt_parent)
> 			;
> 		if p not in set {
> 			if p is not committed to unmount
> 				remove everything from m to p from set
> 				continue
> 		} else {
> 			p = p->mnt_parent
> 		}
> 		commit everything from m to p to unmount, removing from set
>
> I'll try to put something of that sort together, along with
> detailed explanation of what it's doing - in D/f/*, rather than
> buring it in commit messages, and along with "read and update
> D/f/... if you are ever touch this function" in fs/pnode.c itself;
> this fun is not something I would like to repeat several years
> down the road ;-/

I think I understand what you are saying.  But I will have to see the
actually code.

> We *REALLY* need a good set of regression tests for that stuff.
> If you have anything along those lines sitting somewhere, please
> post a reference.  The same goes for everybody else who might
> have something in that general area.

I will have to look.  As I recall everything I have is completely manual
but it could be a starting point at least.

> [*] well, that and with fixed skip_propagation_subtree() logics; it's
> easier to combine it with propagation_next() rather than trying to set
> the things up so that the next call of propagation_next() would DTRT -
> it's less work and yours actually has a corner case if the last element
> of ->mnt_slave_list has non-empty ->mnt_slave_list itself.

I hope that helps a little,

Eric