[BUG] propagate_umount() breakage

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 12 May 2025 00:27:32 +0100

reproducer:
------------------------------------------------------------
# create a playground
mkdir /tmp/foo
mount -t tmpfs none /tmp/foo
mount --make-private /tmp/foo
cd /tmp/foo

# set one-way propagation from A to B
mkdir A
mkdir B
mount -t tmpfs none A
mount --make-shared A
mount --bind A B
mount --make-slave B

# A/1 -> B/1, A/1/2 -> B/1/2
mkdir A/1
mount -t tmpfs none A/1
mkdir A/1/2
mount -t tmpfs none A/1/2

# overmount the entire B/1/2
mount -t tmpfs none B/1/2

# make sure it's busy - set a mount at B/1/2/x
mkdir B/1/2/x
mount -t tmpfs none B/1/2/x

stat B/1/x # shouldn't exist

umount -l A/1

stat B/1/x # ... and now it does
------------------------------------------------------------

What happens is that mounts on B/1 and B/1/2 had been considered
as victims - and taken out, since the overmount on top of B/1/2
overmounted the root of the first mount on B/1/2 and it got
reparented - all the way to B/1.

Correct behaviour would be to have B/1 left in place and upper
B/1/2 to be reparented once.

As an aside, that's a catch from several days of attempts to prove
correctness of propagate_umount(); I'm still not sure there's
nothing else wrong with it.  _Maybe_ it's the only problem in
there, but reconstructing the proof of correctness has turned
out to be a real bitch ;-/

I seriously suspect that a lot of headache comes from trying
to combine collecting the full set of potential victims with
deciding what can and what can not be taken out - gathering
all of them first would simplify things.  First pass would've
logics similar to your variant, but without __propagate_umount()
part[*]

After the set is collected, we could go through it, doing the
something along the lines of
	how = 0
	for each child in children(m)
		if child in set
			continue
		how = 1
		if child is not mounted on root
			how = 2
			break
	if how == 2
		kick_out_ancestors(m)
		remove m itself from set // needs to cooperate with outer loop
	else if how == 1
		for (p = m; p in set && p is mounted on root; p = p->mnt_parent)
			;
		if p in set
			kick_out_ancestors(p)
	else if children(m) is empty && m is not locked	// to optimize things a bit
		commit to unmounting m (again, needs to cooperate with the outer loop)

"Cooperate with the outer loop" might mean something like
having this per-element work leave removal of its argument to
caller and report whether its argument needs to be removed.

After that we'd be left with everything still in the set
having no out-of-set children that would be obstacles.
The only thing that remains after that is MNT_LOCKED and
that's as simple as
	while set is not empty
		m = first element of set
		for (p = m; p is locked && p in set; p = p->mnt_parent)
			;
		if p not in set {
			if p is not committed to unmount
				remove everything from m to p from set
				continue
		} else {
			p = p->mnt_parent
		}
		commit everything from m to p to unmount, removing from set

I'll try to put something of that sort together, along with
detailed explanation of what it's doing - in D/f/*, rather than
buring it in commit messages, and along with "read and update
D/f/... if you are ever touch this function" in fs/pnode.c itself;
this fun is not something I would like to repeat several years
down the road ;-/

We *REALLY* need a good set of regression tests for that stuff.
If you have anything along those lines sitting somewhere, please
post a reference.  The same goes for everybody else who might
have something in that general area.

[*] well, that and with fixed skip_propagation_subtree() logics; it's
easier to combine it with propagation_next() rather than trying to set
the things up so that the next call of propagation_next() would DTRT -
it's less work and yours actually has a corner case if the last element
of ->mnt_slave_list has non-empty ->mnt_slave_list itself.