[PATCHES][RFC][CFT] mount fixes

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sat, 16 Aug 2025 00:33:16 +0100

	Several regression fixes of varying severity; live in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git, individual
patches in followups.  Help with testing and reviewing would be very
welcome.

1) attach_recursive_mnt() slowdown - severe in some pathological cases.
This cycle regression, fortunately trivial to fix.  Underlying mechanism
of slowdown (bypassed by that fix) is due to bad semantics of skip_mnt_tree(),
but that's a separate story.
The case with severe slowdown is mounting on top of a long chain of overmounts,
all peers with each other.  Ends up being time-quardatic by the depth of chain,
easily triggers softlockup warnings.  Since the slow part is under mount_lock,
all pathname resolution pretty much stops for the duration.  Kudos for catching
that to "Lai, Yi" <yi1.lai@xxxxxxxxxxxxxxx>...

2) reparenting in propagate_umount() should *NOT* be done to mounts that are
themselves going to be taken out.  I'd even put "<argument> is not to going
away, and it overmounts the top of a stack of mounts that are going away"
right on top of reparent()...  and lost the logics in its caller that used
to check that at some point during reshuffling the queue ;-/  This cycle's
regression and that one is not just a slowdown - in some cases it might end
up as data corruptor.  Trivial to fix...

3) permission checks for --make-{shared,slave,private,unlinkable} (i.e.
MS_{SHARED,...} in mount(2) flags) ended up being too restrictive.
The checks used to be too weak until the last cycle.  Then they were
made stricter - basically to "mount is a part of our namespace and we have
admin permissions there".  Turns out that CRIU userland depended upon the
weak^Wabsent checks we used to have there before.  Sane replacement that
would suffice for them is "mount is a part of *SOME* namespace that we have
admin permissions for".  Last cycle userland regression...

4) change_mnt_propagation() slowdown in some cases.  On umount we want all
victims out of propagation graph and propagation between the surviving mounts
to be unchanged.  So if victim used to have slaves, they need to be transfered
to its peer (if any) or master.  In case when victim had many peers, all
taken out by that umount(), that ended up with all its slaves being gradually
transferred between all peers until we finally ran out of those.  It can
easily lead to quadratic time.  The patch in -rc1 switched that to "just
find where they'll end up upfront, and move them once", which eliminated
that... except that I hadn't noticed that on massage of change_mnt_propagation()
we ended up calculating the place where they'd be transferred in cases
when there had been nothing to transfer.  With obvious effects when there
had been a large peer group entirely taken out, with not a single slave between
them.  The minimal fix ("call propagation_source() only if we are going to
use its return value") is enough to recover in all cases.
Longer term we should kick all victims out of propagation graph at once
and I have that plotted out, but that's for the next merge window; for
now the minimal obvious fix is good enough.