On Fri, May 16, 2025 at 06:21:39AM +0100, Al Viro wrote: > > On Thu, May 15, 2025 at 12:41:50PM +0100, Al Viro wrote: > > > On Tue, May 13, 2025 at 04:56:22AM +0100, Al Viro wrote: > > > > > > > And yes, it's going to be posted along with the proof of correctness - > > > > I've had it with the amount of subtle corner cases in that area ;-/ > > > > > > FWIW, it seems to survive testing; it's _not_ final - I'm still editing > > > the promised proof, but by this point it's stylistic work. I hadn't > > > touched that kind of formal writing for quite a while, so the text is clumsy. > > > The code changes are pretty much in the final shape. Current diff of > > > code (all in fs/pnode.*) relative to mainline: > > > > ... and the current notes are below; they obviously need more editing ;-/ > > With some cleaning the notes up, see > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git propagate_umount > > The only commit there (on top of -rc5) below; it seems to survive tests I'd > been able to find. Please, review and hit that thing with any tests you might > have sitting around. > > Note that in new variant skip_propagation_subtree() call is *not* followed > by propagation_next() - it does what combination would've done (and it > skips everything that gets propagation from that mount, peers included). > > Nobody except propagate_umount() had been using it, so there's no other > callers to be broken by semantics change... > > D/f/propagate_umount.txt is done as plain text; if anyone cares to turn that > into ReST - wonderful. If anyone has editing suggestions (it's still pretty > raw as a text and I'm pretty sure that it can be improved stylistically) - > even better. > > --- > [PATCH] Rewrite of propagate_umount() > > The variant currently in the tree has problems; trying to prove > correctness has caught at least one class of bugs (reparenting > that ends up moving the visible location of reparented mount, due > to not excluding some of the counterparts on propagation that > should've been included). > > I tried to prove that it's the only bug there; I'm still not sure > whether it is. If anyone can reconstruct and write down an analysis > of the mainline implementation, I'll gladly review it; as it is, > I ended up doing a different implementation. Candidate collection > phase is similar, but trimming the set down until it satisfies the > constraints turned out pretty different. > > I hoped to do transformation as a massage series, but that turns out > to be too convoluted. So it's a single patch replacing propagate_umount() > and friends in one go, with notes and analysis in D/f/propagate_umount.txt > (in addition to inline comments). > > As far I can tell, it is provably correct and provably linear by the number > of mounts we need to look at in order to decide what should be unmounted. > It even builds and seems to survive testing... > > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > --- > diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt > new file mode 100644 > index 000000000000..d8367ba3ea82 > --- /dev/null > +++ b/Documentation/filesystems/propagate_umount.txt > @@ -0,0 +1,433 @@ > + Notes on propagate_umount() > + > +Umount propagation starts with a set of mounts we are already going to > +take out. Ideally, we would like to add all downstream cognates to > +that set - anything with the same mountpoint as one of the removed > +mounts and with parent that would receive events from the parent of that > +mount. However, there are some constraints the resulting set must > +satisfy. > + > +It is convenient to define several properties of sets of mounts: > + > +1) A set S of mounts is non-shifting if for any mount X belonging > +to S all subtrees mounted strictly inside of X (i.e. not overmounting > +the root of X) contain only elements of S. I think "shifting" is misleading. I would suggest either "isolated" or "contained" or ideally "closed" which would mean... > + > +2) A set S is non-revealing if all locked mounts that belong to S have > +parents that also belong to S. > + > +3) A set S is closed if it contains all children of its elements. ... that should probably become "complete" or "saturated". > + > +The set of mounts taken out by umount(2) must be non-shifting and > +non-revealing; the first constraint is what allows to reparent > +any remaining mounts and the second is what prevents the exposure > +of any concealed mountpoints. > + > +propagate_umount() takes the original set as an argument and tries to > +extend that set. The original set is a full subtree and its root is > +unlocked; what matters is that it's closed and non-revealing. > +Resulting set may not be closed; there might still be mounts outside > +of that set, but only on top of stacks of root-overmounting elements > +of set. They can be reparented to the place where the bottom of > +stack is attached to a mount that will survive. NOTE: doing that > +will violate a constraint on having no more than one mount with > +the same parent/mountpoint pair; however, the caller (umount_tree()) I would prefer if this would insert the term "shadow mounts" since that's what we've traditionally used for that. > +will immediately remedy that - it may keep unmounted element attached > +to parent, but only if the parent itself is unmounted. Since all > +conflicts created by reparenting have common parent *not* in the > +set and one side of the conflict (bottom of the stack of overmounts) > +is in the set, it will be resolved. However, we rely upon umount_tree() > +doing that pretty much immediately after the call of propagate_umount(). > + > +Algorithm is based on two statements: > + 1) for any set S, there is a maximal non-shifting subset of S > +and it can be calculated in O(#S) time. > + 2) for any non-shifting set S, there is a maximal non-revealing > +subset of S. That subset is also non-shifting and it can be calculated > +in O(#S) time. > + > + Finding candidates. > + > +We are given a closed set U and we want to find all mounts that have > +the same mountpoint as some mount m in U *and* whose parent receives > +propagation from the parent of the same mount m. Naive implementation > +would be > + S = {} > + for each m in U > + add m to S > + p = parent(m) > + for each q in Propagation(p) - {p} > + child = look_up(q, mountpoint(m)) > + if child > + add child to S > +but that can lead to excessive work - there might be propagation among the > +subtrees of U, in which case we'd end up examining the same candidates > +many times. Since propagation is transitive, the same will happen to > +everything downstream of that candidate and it's not hard to construct > +cases where the approach above leads to the time quadratic by the actual > +number of candidates. > + > +Note that if we run into a candidate we'd already seen, it must've been > +added on an earlier iteration of the outer loop - all additions made > +during one iteration of the outer loop have different parents. So > +if we find a child already added to the set, we know that everything > +in Propagation(parent(child)) with the same mountpoint has been already > +added. > + S = {} > + for each m in U > + if m in S > + continue > + add m to S > + p = parent(m) > + q = propagation_next(p, p) > + while q > + child = look_up(q, mountpoint(m)) > + if child > + if child in S > + q = skip_them(q, p) > + continue; > + add child to S > + q = propagation_next(q, p) > +where > +skip_them(q, p) > + keep walking Propagation(p) from q until we find something > + not in Propagation(q) > + > +would get rid of that problem, but we need a sane implementation of > +skip_them(). That's not hard to do - split propagation_next() into > +"down into mnt_slave_list" and "forward-and-up" parts, with the > +skip_them() being "repeat the forward-and-up part until we get NULL > +or something that isn't a peer of the one we are skipping". > + > +Note that there can be no absolute roots among the extra candidates - > +they all come from mount lookups. Absolute root among the original > +set is _currently_ impossible, but it might be worth protecting > +against. > + > + Maximal non-shifting subsets. > + > +Let's call a mount m in a set S forbidden in that set if there is a > +subtree mounted strictly inside m and containing mounts that do not > +belong to S. > + > +The set is non-shifting when none of its elements are forbidden in it. > + > +If mount m is forbidden in a set S, it is forbidden in any subset S' it > +belongs to. In other words, it can't belong to any of the non-shifting > +subsets of S. If we had a way to find a forbidden mount or show that > +there's none, we could use it to find the maximal non-shifting subset > +simply by finding and removing them until none remain. > + > +Suppose mount m is forbidden in S; then any mounts forbidden in S - {m} > +must have been forbidden in S itself. Indeed, since m has descendents > +that do not belong to S, any subtree that fits into S will fit into > +S - {m} as well. > + > +So in principle we could go through elements of S, checking if they > +are forbidden in S and removing the ones that are. Removals will > +not invalidate the checks done for earlier mounts - if they were not > +forbidden at the time we checked, they won't become forbidden later. > +It's too costly to be practical, but there is a similar approach that > +is linear by size of S. > + > +Let's say that mount x in a set S is forbidden by mount y, if > + * both x and y belong to S. > + * there is a chain of mounts starting at x and leaving S > + immediately after passing through y, with the first > + mountpoint strictly inside x. > +Note 1: x may be equal to y - that's the case when something not > +belonging to S is mounted strictly inside x. > +Note 2: if y does not belong to S, it can't forbid anything in S. > +Note 3: if y has no children outside of S, it can't forbid anything in S. > + > +It's easy to show that mount x is forbidden in S if and only if x is > +forbidden in S by some mount y. And it's easy to find all mounts in S > +forbidden by a given mount. > + > +Consider the following operation: > + Trim(S, m) = S - {x : x is forbidden by m in S} > + > +Note that if m does not belong to S or has no children outside of S we > +are guaranteed that Trim(S, m) is equal to S. > + > +The following is true: if x is forbidden by y in Trim(S, m), it was > +already forbidden by y in S. > + > +Proof: Suppose x is forbidden by y in Trim(S, m). Then there is a > +chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1} > +is the first element that doesn't belong to Trim(S, m) and the > +mountpoint of x_1 is strictly inside x. If mount r belongs to S, it must > +have been removed by Trim(S, m), i.e. it was forbidden in S by m. > +Then there was a mount chain from r to some child of m that stayed in > +S all the way until m, but that's impossible since x belongs to Trim(S, m) > +and prepending (x_0, ..., x_k) to that chain demonstrates that x is also > +prohibited in S by m, and thus can't belong to Trim(S, m). > +Therefore r can not belong to S and our chain demonstrates that > +x is prohibited by y in S. QED. > + > +Corollary: no mount is forbidden by m in Trim(S, m). Indeed, any > +such mount would have been forbidden by m in S and thus would have been > +in the part of S removed in Trim(S, m). > + > +Corollary: no mount is forbidden by m in Trim(Trim(S, m), n). Indeed, > +any such would have to have been forbidden by m in Trim(S, m), which > +is impossible. > + > +Corollary: after > + S = Trim(S, x_1) > + S = Trim(S, x_2) > + ... > + S = Trim(S, x_k) > +no mount remaining in S will be forbidden by either of x_1,...,x_k. > + > +If the S is ordered, the following will reduce it to the maximal non-shifting > +subset: > + if S is not empty > + m = first(S) > + while true > + S = Trim(S, m) > + if there's no elements of S greater than m > + break > + m = first of such elements > + > +It's easy to see that at the beginning of each iteration all elements still > +remaining in S and preceding the current value of m will have already been > +seen as values of m. By the time we leave the loop all elements remaining > +in S will have been seen as values of m. > + > +We also know that no mount remaining in S will be forbidden by any > +of the values of m we have observed in the loop. In other words, > +no mount remaining in S will be forbidden, i.e. final value of S will > +be non-shifting. It will be the maximal non-shifting subset, since we > +were removing only forbidden elements. > + > +Implementation notes: > + > + The following reduces S to the maximal non-shifting subset > +in O(#S), assuming that S is ordered, we can mark the elements and > +originally no marks are set: > + > + for (x = first(S); x; x = Trim_one(x)) > + ; > +where > +Trim_one(m) > + remove_this = false > + found = false > + for each n in children(m) > + if n not in S > + found = true > + if (mountpoint(n) != root(m)) > + remove_this = true > + break > + if found > + Trim_ancestors(m) > + res = next(m) > + if remove_this > + unmark m and remove it from S > + return res > + > +Trim_ancestors(m) > + for (p = parent(m); p in S; m = p, p = parent(p)) { > + if m is marked // all elements beneath are overmounts > + return > + mark m > + if (mountpoint(m) != root(p)) > + unmark p and remove it from S > + } > + > +Trim_one(m) will > + * replace S with Trim(S, m) > + * return NULL if no elements greater than m remain in S > + * return the smallest of such elements otherwise > + * maintain the following invariants: > + * only elements of S are marked > + * if p is marked and > + (x_0, x_1, ..., x_k = p) is an ancestry chain and > + all x_i belong to S > + then > + for any i > 0 x_i overmounts the root of x_{i-1} > +Proof: > + Consider the chains excluding elements from Trim(S, m). The last > +two elements in each are m and some child of m that does not belong to S. > +If there's no such children, Trim(S, m) is equal to S and next(m) is the > +correct return value. > + > + m itself is removed if and only if the chain has exactly two > +elements, i.e. when the last element does not overmount the root of m. > +In other words, if there is a child not in S that does not overmount > +the root of m. Once we are done looking at the children 'remove_this' > +is true if and only if m itself needs to be removed. > + > + All other elements to remove will be ancestors of m, such that > +the entire descent chain from them to m is contained in S. Let > +(x_0, x_1, ..., x_k = m) be the longest such chain. x_i needs to be > +removed if and only if x_{i+1} does not overmount its root. > + > + Note that due to the invariant for marks, finding x_{i+1} marked > +means that none of its ancestors will qualify for removal. What's more, > +after we have found and removed everything we needed, all remaining > +elements of the chain can be marked - their nearest ancestors that do > +not overmount their parent's root will be outside of S. Since ancestry > +chains can't loop, we can set the marks as we go - we won't be looking > +at that element again until after all removals. > + > + If Trim_one(m) needs to remove m, it does that after all other > +removals. Once those removals (if any) are done m is still an element > +of our set, so the smallest remaining element greater than m is equal > +to next(m). Once it is found, we can finally remove m itself. > + > + Time complexity: > +* we get no more than O(#S) calls of Trim_one() > +* the loop over children in Trim_one() never looks at the same child > +twice through all the calls. > +* iterations of that loop for children in S are no more than O(#S) > +in the worst case > +* at most two children that are not elements of S are considered per > +call of Trim_one(). > +* the second loop in Trim_one() sets mark once per iteration and > +no element of S has is set more than once. > + > + In practice we have a large closed non-revealing subset in S - > +the mounts we are already committed to unmounting. It can be used to > +reduce the amount of work. What's more, we can have all elements of that > +subset removed from their parents' lists of children, which considerably > +simplifies life. > + > + Since the committed subset is closed, passing one of its elements > +to Trim_one() is a no-op - it doesn't have any children outside of S. > +No matter what we pass to Trim_one(), Trim_ancestors() will never look > +at any elements of the committed subset - it's not called by Trim_one() > +if argument belongs to that subset and it can't walk into that subset > +unless it has started in there. > + > + So it's useful to keep the sets of committed and undecided > +candidates separately; Trim_one() needs to be called only for elements > +of the latter. > + > + What's more, if we see that children(m) is empty and m is not > +locked, we can immediately move m into the committed subset (remove > +from the parent's list of children, etc.). That's one fewer mount we'll > +have to look into when we check the list of children of its parent *and* > +when we get to building the non-revealing subset. > + > + Maximal non-revealing subsets > + > +If S is not a non-revealing subset, there is a locked element x in S > +such that parent of x is not in S. > + > +Obviously, no non-revealing subset of S may contain x. Removing such > +elements one by one will obviously end with the maximal non-revealing > +subset (possibly empty one). Note that removal of an element will > +require removal of all its locked children, etc. > + > +If the set had been non-shifting, it will remain non-shifting after > +such removals. > +Proof: suppose S was non-shifting, x is a locked element of S, parent of x > +is not in S and S - {x} is not non-shifting. Then there is an element m > +in S - {x} and a subtree mounted strictly inside m, such that m contains > +an element not in in S - {x}. Since S is non-shifting, everything in > +that subtree must belong to S. But that means that this subtree must > +contain x somewhere *and* that parent of x either belongs that subtree > +or is equal to m. Either way it must belong to S. Contradiction. > + > +// same representation as for finding maximal non-shifting subsets: > +// S is a disjoint union of a non-revealing set U (the ones we are committed > +// to unmount) and a set of candidates. > +// Elements of U are removed from their parents' lists of children. > +// In the end candidates becomes empty and maximal non-revealing non-shifting > +// subset of S is now in U > + while (candidates is non-empty) > + handle_locked(first(candidates)) > + > +handle_locked(m) > + cutoff = m > + for (p = m; p in candidates; p = parent(p)) { > + unmark(p) and remove p from candidates; > + if (!locked(p)) > + cutoff = parent(p) > + } > + if (p in U) > + cutoff = p > + while (m != cutoff) { > + remove m from children(parent(m)); > + add m to U; > + m = parent(m); > + } > + > +Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S. > +* If it contains some elements of U, let x_k be the last one of those. > +Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing. > +* otherwise if all its elements are locked, then none of {x_0, ..., x_n} > +may be elements of a non-revealing subset of S. > +* otherwise let x_k be the first unlocked element of the chain. Then none > +of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of > +S and union of U and {x_k, ..., x_n} is non-revealing. > + > +handle_locked(m) finds which of these cases applies and adjusts C and > +U accordingly. U remains non-revealing, union of C and U still contains > +any non-revealing subset of S and after the call of handle_locked(m) m > +is guaranteed to be not in C. So having it called for each element of S > +would suffice to empty C, leaving U the maximal non-revealing subset of S. > + > +However, handle_locked(m) is a no-op when m belongs to U, so it's enough > +to have it called for elements of C while there are any. > + > +Time complexity: number of calls of handle_locked() is limited by #C, > +each iteration of the first loop in handle_locked() removes an element > +for C, so their total number of executions is also limited by #C; > +number of iterations in the second loop is no greater than the number > +of iterations of the first loop. > + > + > + Reparenting > + > +After we'd calculated the final set, we still need to deal with > +reparenting - if an element of the final set has a child not in it, > +we need to reparent such child. > + > +Such children can only be root-overmounting (otherwise the set wouldn't > +be non-shifting) and their parents can not belong to the original set, > +since the original is guaranteed to be closed. > + > + > + Putting all of that together > + > +The plan is to > + * find all candidates > + * trim down to maximal non-shifting subset > + * trim down to maximal non-revealing subset > + * reparent anything that needs to be reparented > + * return the resulting set to the caller > + > +For the 2nd and 3rd steps we want to separate the set into growing > +non-revealing subset, initially containing the original set ("U" in > +terms of the pseudocode above) and everything we are still not sure about > +("candidates"). It means that for the output of the 1st step we'd like > +the extra candidates separated from the stuff already in the original set. > +For the 4th step we would like the additions to U separate from the > +original set. > + > +So let's go for > + * original set ("set"). Linkage via mnt_list > + * undecided candidates ("candidates"). Linkage via mnt_umounting. > + * anything in U that hadn't been in the original set - elements of > +candidates will gradually be either discarded or moved there. In other > +words, it's the candidates we have already decided to unmount. Its role > +is reasonably close to the old "to_umount", so let's use that name. > +Linkage via mnt_list. > + > +For gather_candidates() we'll need to maintain both candidates (S - > +set) and intersection of S with set, with the latter emptied once we > +are done. Use mnt_umounting for both, that'll give a cheap way to check > +belonging to their union (in gather_candidates()) and to candidates > +itself (at later stages). Call that predicate is_candidate(); it would > +be !list_empty(&m->mnt_umounting) > + > +All elements of the original set are marked with MNT_UMOUNT and we'll > +need the same for elements added when joining the contents of to_umount > +to set in the end. Let's set MNT_UMOUNT at the time we add an element > +to to_umount; that's close to what the old 'umount_one' is doing, so > +let's keep that name. It also gives us another predicate we need - > +"belongs to union of set and to_umount"; will_be_unmounted() for now. > + > +Another helper - unmark and remove from candidates; remove_candidate(m) Stray sentence? > diff --git a/fs/pnode.c b/fs/pnode.c > index fb77427df39e..9b2f1ac80f25 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -24,11 +24,6 @@ static inline struct mount *first_slave(struct mount *p) > return list_entry(p->mnt_slave_list.next, struct mount, mnt_slave); > } > > -static inline struct mount *last_slave(struct mount *p) > -{ > - return list_entry(p->mnt_slave_list.prev, struct mount, mnt_slave); > -} > - > static inline struct mount *next_slave(struct mount *p) > { > return list_entry(p->mnt_slave.next, struct mount, mnt_slave); > @@ -136,6 +131,23 @@ void change_mnt_propagation(struct mount *mnt, int type) > } > } > > +static struct mount *__propagation_next(struct mount *m, > + struct mount *origin) > +{ > + while (1) { > + struct mount *master = m->mnt_master; > + > + if (master == origin->mnt_master) { > + struct mount *next = next_peer(m); > + return (next == origin) ? NULL : next; > + } else if (m->mnt_slave.next != &master->mnt_slave_list) > + return next_slave(m); Please add a comment to that helper that explains how it walks the propagation tree. I remember having to fix bugs in that code and the lack of comments was noticable. > + > + /* back at master */ > + m = master; > + } > +} > + > /* > * get the next mount in the propagation tree. > * @m: the mount seen last > @@ -153,31 +165,26 @@ static struct mount *propagation_next(struct mount *m, > if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list)) > return first_slave(m); > > - while (1) { > - struct mount *master = m->mnt_master; > - > - if (master == origin->mnt_master) { > - struct mount *next = next_peer(m); > - return (next == origin) ? NULL : next; > - } else if (m->mnt_slave.next != &master->mnt_slave_list) > - return next_slave(m); > + return __propagation_next(m, origin); > +} > > - /* back at master */ > - m = master; > - } > +static inline bool peers(const struct mount *m1, const struct mount *m2) > +{ > + return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id; > } > > static struct mount *skip_propagation_subtree(struct mount *m, > struct mount *origin) > { > /* > - * Advance m such that propagation_next will not return > - * the slaves of m. > + * Advance m past everything that gets propagation from it. > */ > - if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list)) > - m = last_slave(m); > + struct mount *p = __propagation_next(m, origin); > > - return m; > + while (p && peers(m, p)) > + p = __propagation_next(p, origin); > + > + return p; > } > > static struct mount *next_group(struct mount *m, struct mount *origin) > @@ -216,11 +223,6 @@ static struct mount *next_group(struct mount *m, struct mount *origin) > static struct mount *last_dest, *first_source, *last_source, *dest_master; > static struct hlist_head *list; > > -static inline bool peers(const struct mount *m1, const struct mount *m2) > -{ > - return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id; > -} > - > static int propagate_one(struct mount *m, struct mountpoint *dest_mp) > { > struct mount *child; > @@ -463,181 +465,219 @@ void propagate_mount_unlock(struct mount *mnt) > } > } > > -static void umount_one(struct mount *mnt, struct list_head *to_umount) > +static LIST_HEAD(to_umount); // committed to unmounting those > +static LIST_HEAD(candidates); // undecided unmount candidates > + > +static inline struct mount *first_candidate(void) > { > - CLEAR_MNT_MARK(mnt); > - mnt->mnt.mnt_flags |= MNT_UMOUNT; > - list_del_init(&mnt->mnt_child); > - list_del_init(&mnt->mnt_umounting); > - move_from_ns(mnt, to_umount); > + if (list_empty(&candidates)) > + return NULL; > + return list_first_entry(&candidates, struct mount, mnt_umounting); > } > > -/* > - * NOTE: unmounting 'mnt' naturally propagates to all other mounts its > - * parent propagates to. > - */ > -static bool __propagate_umount(struct mount *mnt, > - struct list_head *to_umount, > - struct list_head *to_restore) > +static inline bool is_candidate(struct mount *m) > { > - bool progress = false; > - struct mount *child; > + return !list_empty(&m->mnt_umounting); > +} > > - /* > - * The state of the parent won't change if this mount is > - * already unmounted or marked as without children. > - */ > - if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED)) > - goto out; > +static inline bool will_be_unmounted(struct mount *m) > +{ > + return m->mnt.mnt_flags & MNT_UMOUNT; > +} > > - /* Verify topper is the only grandchild that has not been > - * speculatively unmounted. > - */ > - list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { > - if (child->mnt_mountpoint == mnt->mnt.mnt_root) > - continue; > - if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child)) > - continue; > - /* Found a mounted child */ > - goto children; > - } > +static void umount_one(struct mount *m) > +{ > + m->mnt.mnt_flags |= MNT_UMOUNT; > + list_del_init(&m->mnt_child); > + move_from_ns(m, &to_umount); > +} > > - /* Mark mounts that can be unmounted if not locked */ > - SET_MNT_MARK(mnt); > - progress = true; > +static void remove_candidate(struct mount *m) > +{ > + CLEAR_MNT_MARK(m); // unmark on removal from candidates > + list_del_init(&m->mnt_umounting); > +} > > - /* If a mount is without children and not locked umount it. */ > - if (!IS_MNT_LOCKED(mnt)) { > - umount_one(mnt, to_umount); > - } else { > -children: > - list_move_tail(&mnt->mnt_umounting, to_restore); > +static void gather_candidates(struct list_head *set) > +{ > + LIST_HEAD(visited); > + struct mount *m, *p, *q; > + > + list_for_each_entry(m, set, mnt_list) { > + if (is_candidate(m)) > + continue; > + list_add(&m->mnt_umounting, &visited); > + p = m->mnt_parent; > + q = propagation_next(p, p); > + while (q) { > + struct mount *child = __lookup_mnt(&q->mnt, > + m->mnt_mountpoint); > + if (child) { > + struct list_head *head; > + > + /* > + * We might've already run into this one. That > + * must've happened on earlier iteration of the > + * outer loop; in that case we can skip those > + * parents that get propagation from q - there > + * will be nothing new on those as well. > + */ > + if (is_candidate(child)) { > + q = skip_propagation_subtree(q, p); > + continue; > + } > + if (will_be_unmounted(child)) > + head = &visited; > + else > + head = &candidates; > + list_add(&child->mnt_umounting, head); > + } > + q = propagation_next(q, p); > + } > } > -out: > - return progress; > + while (!list_empty(&visited)) // empty visited > + list_del_init(visited.next); > } > > -static void umount_list(struct list_head *to_umount, > - struct list_head *to_restore) > +/* > + * We know that some child of @m can't be unmounted. In all places where the > + * chain of descent of @m has child not overmounting the root of parent, > + * the parent can't be unmounted either. > + */ > +static void trim_ancestors(struct mount *m) > { > - struct mount *mnt, *child, *tmp; > - list_for_each_entry(mnt, to_umount, mnt_list) { > - list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) { > - /* topper? */ > - if (child->mnt_mountpoint == mnt->mnt.mnt_root) > - list_move_tail(&child->mnt_umounting, to_restore); > - else > - umount_one(child, to_umount); > - } > + struct mount *p; > + > + for (p = m->mnt_parent; is_candidate(p); m = p, p = p->mnt_parent) { > + if (IS_MNT_MARKED(m)) // all candidates beneath are overmounts > + return; > + SET_MNT_MARK(m); > + if (m->mnt_mountpoint != p->mnt.mnt_root) > + remove_candidate(p); > } > } > > -static void restore_mounts(struct list_head *to_restore) > +/* > + * Find and exclude all umount candidates forbidden by @m > + * (see Documentation/filesystems/propagate_umount.txt) > + * If we can immediately tell that @m is OK to unmount (unlocked > + * and all children are already committed to unmounting) commit > + * to unmounting it. > + * Returns the next candidate to be trimmed. > + */ > +static struct mount *trim_one(struct mount *m) > { > - /* Restore mounts to a clean working state */ > - while (!list_empty(to_restore)) { > - struct mount *mnt, *parent; > - struct mountpoint *mp; > - > - mnt = list_first_entry(to_restore, struct mount, mnt_umounting); > - CLEAR_MNT_MARK(mnt); > - list_del_init(&mnt->mnt_umounting); > - > - /* Should this mount be reparented? */ > - mp = mnt->mnt_mp; > - parent = mnt->mnt_parent; > - while (parent->mnt.mnt_flags & MNT_UMOUNT) { > - mp = parent->mnt_mp; > - parent = parent->mnt_parent; > - } > - if (parent != mnt->mnt_parent) { > - mnt_change_mountpoint(parent, mp, mnt); > - mnt_notify_add(mnt); > + bool remove_this = false, found = false, umount_this = false; > + struct mount *n; > + struct list_head *next; > + > + list_for_each_entry(n, &m->mnt_mounts, mnt_child) { > + if (!is_candidate(n)) { > + found = true; > + if (n->mnt_mountpoint != m->mnt.mnt_root) { > + remove_this = true; > + break; > + } > } > } > + if (found) { > + trim_ancestors(m); > + } else if (!IS_MNT_LOCKED(m) && list_empty(&m->mnt_mounts)) { > + remove_this = true; > + umount_this = true; > + } > + next = m->mnt_umounting.next; > + if (remove_this) { > + remove_candidate(m); > + if (umount_this) > + umount_one(m); > + } > + if (next == &candidates) > + return NULL; > + > + return list_entry(next, struct mount, mnt_umounting); > } > > -static void cleanup_umount_visitations(struct list_head *visited) > +static void handle_locked(struct mount *m) > { > - while (!list_empty(visited)) { > - struct mount *mnt = > - list_first_entry(visited, struct mount, mnt_umounting); > - list_del_init(&mnt->mnt_umounting); > + struct mount *cutoff = m, *p; > + > + for (p = m; is_candidate(p); p = p->mnt_parent) { > + remove_candidate(p); > + if (!IS_MNT_LOCKED(p)) > + cutoff = p->mnt_parent; > + } > + if (will_be_unmounted(p)) > + cutoff = p; > + while (m != cutoff) { > + umount_one(m); > + m = m->mnt_parent; > } > } > > /* > - * collect all mounts that receive propagation from the mount in @list, > - * and return these additional mounts in the same list. > - * @list: the list of mounts to be unmounted. > + * @m is not to going away, and it overmounts the top of a stack of mounts > + * that are going away. We know that all of those are fully overmounted > + * by the one above (@m being the topmost of the chain), so @m can be slid > + * in place where the bottom of the stack is attached. > * > - * vfsmount lock must be held for write > + * NOTE: here we temporarily violate a constraint - two mounts end up with > + * the same parent and mountpoint; that will be remedied as soon as we > + * return from propagate_umount() - its caller (umount_tree()) will detach > + * the stack from the parent it (and now @m) is attached to. umount_tree() > + * might choose to keep unmounted pieces stuck to each other, but it always > + * detaches them from the mounts that remain in the tree. > */ > -int propagate_umount(struct list_head *list) > +static void reparent(struct mount *m) > { > - struct mount *mnt; > - LIST_HEAD(to_restore); > - LIST_HEAD(to_umount); > - LIST_HEAD(visited); > + struct mount *p = m; > + struct mountpoint *mp; > > - /* Find candidates for unmounting */ > - list_for_each_entry_reverse(mnt, list, mnt_list) { > - struct mount *parent = mnt->mnt_parent; > - struct mount *m; > + do { > + mp = p->mnt_mp; > + p = p->mnt_parent; > + } while (will_be_unmounted(p)); That's nice and clear! > > - /* > - * If this mount has already been visited it is known that it's > - * entire peer group and all of their slaves in the propagation > - * tree for the mountpoint has already been visited and there is > - * no need to visit them again. > - */ > - if (!list_empty(&mnt->mnt_umounting)) > - continue; > + mnt_change_mountpoint(p, mp, m); > + mnt_notify_add(m); > +} > > - list_add_tail(&mnt->mnt_umounting, &visited); > - for (m = propagation_next(parent, parent); m; > - m = propagation_next(m, parent)) { > - struct mount *child = __lookup_mnt(&m->mnt, > - mnt->mnt_mountpoint); > - if (!child) > - continue; > +/** > + * propagate_umount - apply propagation rules to the set of mounts for umount() > + * @set: the list of mounts to be unmounted. > + * > + * Collect all mounts that receive propagation from the mount in @set and have > + * no obstacles to being unmounted. Add these additional mounts to the set. > + * > + * See Documentation/filesystems/propagate_umount.txt if you do anything in > + * this area. > + * > + * Locks held: > + * mount_lock (write_seqlock), namespace_sem (exclusive). > + */ > +void propagate_umount(struct list_head *set) > +{ > + struct mount *m; > > - if (!list_empty(&child->mnt_umounting)) { > - /* > - * If the child has already been visited it is > - * know that it's entire peer group and all of > - * their slaves in the propgation tree for the > - * mountpoint has already been visited and there > - * is no need to visit this subtree again. > - */ > - m = skip_propagation_subtree(m, parent); > - continue; > - } else if (child->mnt.mnt_flags & MNT_UMOUNT) { > - /* > - * We have come across a partially unmounted > - * mount in a list that has not been visited > - * yet. Remember it has been visited and > - * continue about our merry way. > - */ > - list_add_tail(&child->mnt_umounting, &visited); > - continue; > - } > + // collect all candidates > + gather_candidates(set); > > - /* Check the child and parents while progress is made */ > - while (__propagate_umount(child, > - &to_umount, &to_restore)) { > - /* Is the parent a umount candidate? */ > - child = child->mnt_parent; > - if (list_empty(&child->mnt_umounting)) > - break; > - } > - } > - } > + // reduce the set until it's non-shifting Re earlier comments: "Reduce the set until it's closed" seems a lot clearer to me. > + for (m = first_candidate(); m; m = trim_one(m)) > + ; > > - umount_list(&to_umount, &to_restore); > - restore_mounts(&to_restore); > - cleanup_umount_visitations(&visited); > - list_splice_tail(&to_umount, list); > + // ... and non-revealing > + while (!list_empty(&candidates)) > + handle_locked(first_candidate()); > > - return 0; > + // now to_umount consists of all acceptable candidates > + // deal with reparenting of remaining overmounts on those > + list_for_each_entry(m, &to_umount, mnt_list) { > + while (!list_empty(&m->mnt_mounts)) // should be at most one > + reparent(list_first_entry(&m->mnt_mounts, > + struct mount, mnt_child)); > + } > + > + // and fold them into the set > + list_splice_tail_init(&to_umount, set); > } > diff --git a/fs/pnode.h b/fs/pnode.h > index 34b6247af01d..d84a397bfd43 100644 > --- a/fs/pnode.h > +++ b/fs/pnode.h > @@ -39,7 +39,7 @@ static inline void set_mnt_shared(struct mount *mnt) > void change_mnt_propagation(struct mount *, int); > int propagate_mnt(struct mount *, struct mountpoint *, struct mount *, > struct hlist_head *); > -int propagate_umount(struct list_head *); > +void propagate_umount(struct list_head *); > int propagate_mount_busy(struct mount *, int); > void propagate_mount_unlock(struct mount *); > void mnt_release_group_id(struct mount *); Ok, this looks a lot cleaner to me than what we had before and the argument seems sane as well. With the requested changes: Reviewed-by: Christian Brauner <brauner@xxxxxxxxxx>