On Thu, May 15, 2025 at 12:41:50PM +0100, Al Viro wrote: > On Tue, May 13, 2025 at 04:56:22AM +0100, Al Viro wrote: > > > And yes, it's going to be posted along with the proof of correctness - > > I've had it with the amount of subtle corner cases in that area ;-/ > > FWIW, it seems to survive testing; it's _not_ final - I'm still editing > the promised proof, but by this point it's stylistic work. I hadn't > touched that kind of formal writing for quite a while, so the text is clumsy. > The code changes are pretty much in the final shape. Current diff of > code (all in fs/pnode.*) relative to mainline: ... and the current notes are below; they obviously need more editing ;-/ ----- Umount propagation starts with a set of mounts we are already going to take out. Ideally, we would like to add all downstream cognates to that set - anything with the same mountpoint as one of the removed mounts and with parent that would receive events from the parent of that mount. However, there are some constraints the resulting set must satisfy. It is convenient to define several properties of sets of mounts: 1) A set S of mounts is non-shifting if for any mount X belonging to S all subtrees mounted strictly inside of X (i.e. not overmounting the root of X) contain only elements of S. 2) A set S is non-revealing if all locked mounts that belong to S have parents that also belong to S. 3) A set S is closed if it contains all children of its elements. The set of mounts taken out by umount(2) must be non-shifting and non-revealing; the first constraint is what allows to reparent any remaining mounts and the second is what prevents the exposure of any concealed mountpoints. propagate_umount() takes the original set as an argument and tries to extend that set. The original set is a full subtree and its root is unlocked; what matters is that it's closed and non-revealing. Resulting set may not be closed; there might still be mounts outside of that set, but only on top of stacks of root-overmounting elements of set. They can be reparented to the place where the bottom of stack is attached to a mount that will survive. NOTE: doing that will violate a constraint on having no more than one mount with the same parent/mountpoint pair; however, the caller (umount_tree()) will immediately remedy that - it may keep unmounted element attached to parent, but only if the parent itself is unmounted. Since all conflicts created by reparenting have common parent *not* in the set and one side of the conflict (bottom of the stack of overmounts) is in the set, it will be resolved. However, we rely upon umount_tree() doing that pretty much immediately after the call of propagate_umount(). Algorithm is based on two statements: 1) for any set S, there is a maximal non-shifting subset of S and it can be calculated in O(#S) time. 2) for any non-shifting set S, there is a maximal non-revealing subset of S. That subset is also non-shifting and it can be calculated in O(#S) time. Finding candidates. We are given a closed set U and we want to find all mounts that have the same mountpoint as some mount m in U *and* whose parent receives propagation from the parent of the same mount m. Naive implementation would be S = {} for each m in U add m to S p = parent(m) for each q in Propagation(p) - {p} child = look_up(q, mountpoint(m)) if child add child to S but that can lead to excessive work - there might be propagation among the subtrees of U, in which case we'd end up examining the same candidates many times. Since propagation is transitive, the same will happen to everything downstream of that candidate and it's not hard to construct cases where the approach above leads to the time quadratic by the actual number of candidates. Note that if we run into a candidate we'd already seen, it must've been added on an earlier iteration of the outer loop - all additions made during one iteration of the outer loop have different parents. So if we find a child already added to the set, we know that everything in Propagation(parent(child)) with the same mountpoint has been already added. S = {} for each m in U if m in S continue add m to S p = parent(m) q = propagation_next(p, p) while q child = look_up(q, mountpoint(m)) if child if child in S q = skip_them(q, p) continue; add child to S q = propagation_next(q, p) where skip_them(q, p) keep walking Propagation(p) from q until we find something not in Propagation(q) would get rid of that problem, but we need a sane implementation of skip_them(). That's not hard to do - split propagation_next() into "down into mnt_slave_list" and "forward-and-up" parts, with the skip_them() being "repeat the forward-and-up part until we get NULL or something that isn't a peer of the one we are skipping". Note that there can be no absolute roots among the extra candidates - they all come from mount lookups. Absolute root among the original set is _currently_ impossible, but it might be worth protecting against. Maximal non-shifting subsets. To prove (1), consider the following operation: Trim(S, T) = S - {x : there is a sequence (x_0,...,x_k), such that k >= 1, for all i < k x_i is the parent of x_{i+1}, for all i < k x_i belongs to S, x_k does not belong to S, mountpoint(x_1) != root(x_0), x = x_0, x_{k-1} belongs to T} In other words, we remove an element x if there is a chain of descendents starting at x, _not_ overmounting root of x and leaving S right after it passes through some element of T. Properties: A) for any element x that belongs to S - Trim(S, T) there is a subtree mounted strictly inside x and containing some elements that do not belong to S. Directly from definition - subtree that consists of x_1 and all its descendents will satisfy our conditions. B) if U is a non-shifting subset of S, U is a subset of Trim(S, T). Indeed, suppose U is not a subset of Trim(S, T). Then there is an element u that belongs both to U and S - Trim(S, T). By (A) there must be a subtree mounted strictly inside u and containing elements that do not belong to S. But U is non-shifting by assumption, so all elements of such subtree will belong to U, which is a subset of S. Contradiction. C) Trim(Trim(S, T1), T2) = Trim(S, union of T1 and T2). [would be easier with pictures... ;-/] First, let's show that Trim(Trim(S, T1), T2) is a subset of Trim(S, union of T1 and T2). Suppose there is an element x of Trim(Trim(S, T1), T2) that does not belong to Trim(S, union of T1 and T2). Since Trim(Trim(S, T1), T2) is a subset of Trim(S, T1), x must be an element of Trim(S, T1). Since it doesn't belong to Trim(S, union of T1 and T2), there is a chain of descent (x_0 = x, x_1, ..., x_k) such that x_1 does not overmount root of x_0 x_0, ..., x_{k-1} all belong to S x_k doesn't belong to S x_{k-1} belongs to T1 or to T2. If some of x_0,...,x_{k-1} do not belong to Trim(S, T1), let x_i be the first such element. Since x is an element of Trim(S, T1), i can't be 0. But then concatenation of (x_0, ..., x_{i-1}) with the chain that has lead to exclusion of x_i from Trim(S, T1) would yield a chain excluding x itself from Trim(S, T1). Since x belongs to Trim(S, T1), we have shown that x_0, ..., x_{k-1} all belong to Trim(S, T1) Since our chain does not exclude x from Trim(S, T1), x_{k-1} can't be an element of T1. So we have x_{k-1} belongs to T2 and x_k doesn't belong to S, let alone Trim(S, T1) i.e. our chain excludes x from Trim(Trim(S, T1), T2). Contradiction. To prove the inclusion in opposite direction, suppose there is an element x that belongs to Trim(S, union of T1 and T2) but not to Trim(Trim(S, T1), T2). x must belong to Trim(S, T1), since any chain that would exclude it from Trim(S, T1) would've excluded it from Trim(S, union of T1 and T2) as well. Since x belongs to Trim(S, T1), but not Trim(Trim(S, T1), T2), there must be a chain of descent (x_0 = x, x_1, ..., x_k) such that x_1 does not overmount root of x_0 x_0, ..., x_{k-1} all belong to Trim(S, T1) x_k doesn't belong to Trim(S, T1) x_{k-1} belongs to T2. x_k must belong to S, otherwise that chain would've excluded x from Trim(S, union of T1 and T2). But concatenation of (x_0, ..., x_{k-1}) with the chain that excluded x_k from Trim(S, T1) will yield a chain that excludes x from Trim(S, union of T1 and T2). Contradition. D) Trim(S, T) = S when S does not intersect with T. Trivial from definition - for removing something it requires a chain with next-to-last element belonging both to T and S. If S and T do not intersect, nothing get removed... E) Trim(S, S) is a non-shifting subset of S that contains any non-shifting subset of S. Suppose Trim(S, S) is *not* non-shifting. Then it must be an element x belonging to Trim(S, S) and a subtree mounted strictly inside x, with some of its elements not belonging to Trim(S, S). Let y be one of the subtree elements that do not belong to Trim(S, S). Let (x_0 = x, x_1, ..., x_k = y) be its chain of descent. Without loss of generality we can assume that all of x_0,...,x_{k-1} belong to Trim(S, S). y must belong to S, since otherwise that chain would've excluded x from Trim(S, S). Consider the chain excluding y from Trim(S, S); its concatenation with (x_0, ..., x_{k-1}) yields a chain that excludes x from Trim(S, S). Contradiction. The second part immediately follows from (B). F) Let U be a closed subset of S. Then U will be a subset of Trim(S, {x}) for any x. Follows from (B), since any such U will be non-shifting. G) Let U be a closed subset of S. Then Trim(S, {u}) = S for any u belonging to U. Follows directly from definition - if the next-to-last element of a descent chain is equal to u, its last element belongs to U and thus to S. Implementation: H1) Let S be an ordered set. Then the following calculates its maximal non-shifting subset: Max_Non_Shifting(S) if S is empty return S x = first(S) V = S while true V = Trim(V, {x}) if there are elements of V greater than x x = first of such elements else return V Let M be the set of all values of x we have observed in the loop. Consider the final value of V. By (C) it's equal to Trim(S, M) and by (B) it contains any non-shifting subset of S. But for any element that has remained in V until the end there must have been an iteration with x equal to that element, so V is a subset of M. In other words, Trim(V, V) = Trim(Trim(S, M), V) = Trim(S, union of M and V) = Trim(S, M) = V so by (E) V is non-shifting. H2) The following is O(#S) equivalent of S = Max_Non_Shifting(S), assuming we can mark the elements and originally no marks are set: Trim_to_non_shifting for (x = first(S); x; x = Trim_one(x)) ; where Trim_one(m) remove_this = false; found = false; for each n in children(m) { if (n not in S) { found = true; if (mountpoint(n) != root(m)) { remove_this = true; break; } } } if (!found) return next(m); for (this = m, p = parent(m); !marked(this) && p in S; this = p, p = parent(p)) { mark(this); if (mountpoint(this) != root(p)) { unmark(p); remove p from S; } } res = next(m); if (remove_this) { unmark(m); remove m from S; } return res; Trim_one(m) will * replace S with Trim(S, {m}) * return NULL if no elements greater than m remain in S * return the smallest of such elements otherwise * maintain the following invariants: * only elements of S are marked * if p is marked and (x_0, x_1, ..., x_k = p) is an ancestry chain and all x_i belong to S then for any i > 0 x_i overmounts the root of x_{i-1} Proof: Consider the chains excluding elements from Trim(S, {m}). The last two elements in each are m and some child of m that does not belong to S. If no such children exist, Trim(S, {m}) is equal to S and next(m) is the correct return value. m itself is removed if and only if the chain has exactly two elements, i.e. when the last element does not overmount the root of m. In other words, if there is a child not in S that does not overmount the root of m. Once we are done looking at the children 'remove_this' is true if and only if m itself needs to be removed. All other elements to remove will be ancestors of m, such that the entire descent chain from them to m is contained in S. Let (x_0, x_1, ..., x_k = m) be the longest such chain. x_i needs to be removed if and only if x_{i+1} does not overmount its root. Note that due to the invariant for marks, finding x_{i+1} marked means that none of its ancestors will qualify for removal. What's more, after we have found and removed everything we needed, all remaining elements of the chain can be marked - their nearest ancestors that do not overmount their parent's root will be outside of S. Since ancestry chains can't loop, we can set the marks as we go - we won't be looking at that element again until after all removals. If Trim_one(m) needs to remove m, it does that after all other removals. Once those removals (if any) are done m is still an element of our set, so the smallest remaining element greater than m is equal to next(m). Once it is found, we can finally remove m itself. Time complexity: * we get no more than O(#S) calls of Trim_one() * the loop over children in Trim_one() never looks at the same child twice through all the calls. * iterations of that loop for children in S are no more than O(#S) in the worst case * at most two children that are not elements of S are considered per call of Trim_one(). * the second loop in Trim_one() sets mark once per iteration and no element of S has is set more than once. H3) In practice we have a large closed non-revealing subset in S. It can be used to reduce the amount of work. What's more, we can have all elements of U removed from their parents' lists of children, which considerably simplifies life. // S is a represented as a disjoint union of an non-nrevealing closed // set U and a set C. Everything in U has been removed from their parents' // lists of children. Trim_to_non_shifting2 for (x = first(C); x; x = Trim_one(x)) ; where Trim_one(m) remove_this = false; found = false; umount_this = false; for each n in children(m) { if (n not in C) { found = true; if (mountpoint(n) != root(m)) { remove_this = true; break; } } } if (found) { this = m; for (p = parent(this); !marked(this) && p in C; m = p, p = parent(p)) { mark(this); if (mountpoint(this) != root(p)) { unmark(p); remove p from C; } } } else if (!locked(m) && children(m) is empty) { remove_this = true; umount_this = true; } res = next(m); if (remove_this) { unmark(m); remove m from C; if (umount_this) { remove m from children(parent(m)) add m to U; } } return res; Equivalent to Trim_to_non_shifting, modulo representation. We know that Trim_one() is a no-op when its argument belongs to U, we won't see any elements of U in lists of children, so we only need to check for belonging to C and we know that parent of anything not in U can't be in U, so the ancestor-trimming loop won't run into anything in U - test for belonging to S is equivalent to test for belonging to C there. Simple candidates for adding to U (not locked and no children not in U) get moved from C to U - U remains closed non-revealing and union remains unchanged. Time complexity is O(#C) here. Maximal non-revealing subsets If S is not a non-revealing subset, there is a locked element x in S such that parent of x is not in S. Obviously, no non-revealing subset of S may contain x. Removing such elements one by one will obviously end with the maximal non-revealing subset (possibly empty one). Note that removal of an element will require removal of all its locked children, etc. If the set had been non-shifting, it will remain non-shifting after such removals. Proof: suppose S was non-shifting, x is a locked element of S, parent of x is not in S and S - {x} is not non-shifting. Then there is an element m in S - {x} and a subtree mounted strictly inside m, such that m contains an element not in in S - {x}. Since S is non-shifting, everything in that subtree must belong to S. But that means that this subtree must contain x somewhere *and* that parent of x either belongs that subtree or is equal to m. Either way it must belong to S. Contradiction. // same representation as for finding maximal non-shifting subsets: // S is a disjoint union of non-revealing set U and a set C. // Elements of U are removed from their parents' lists of children. // in the end C becomes empty and maximal non-revealing non-shifting // subset of S is now in U Trim_locked while (C is non-empty) handle_locked(first(C)) handle_locked(m) cutoff = m for (p = m; p in C; p = parent(p)) { unmark(p); remove p from C; if (!locked(p)) cutoff = parent(p) } if (p in U) cutoff = p while (m != cutoff) { remove m from children(parent(m)); add m to U; m = parent(m); } Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S. * If it contains some elements of U, let x_k be the last one of those. Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing. * otherwise if all its elements are locked, then none of {x_0, ..., x_n} may be elements of a non-revealing subset of S. * otherwise let x_k be the first unlocked element of the chain. Then none of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of S and union of U and {x_k, ..., x_n} is non-revealing. handle_locked(m) finds which of these cases applies and adjusts C and U accordingly. U remains non-revealing, union of C and U still contains any non-revealing subset of S and after the call of handle_locked(m) m is guaranteed to be not in C. So having it called for each element of S would suffice to empty C, leaving U the maximal non-revealing subset of S. However, handle_locked(m) is a no-op when m belongs to U, so it's enough to have it called for elements of C while there are any. Time complexity: number of calls of handle_locked() is limited by #C, each iteration of the first loop in handle_locked() removes an element for C, so their total number of executions is also limited by #C; number of iterations in the second loop is no greater than the number of iterations of the first loop. Reparenting After we'd calculated the final set, we still need to deal with reparenting - if an element of the final set has a child not in it, we need to reparent such child. Such children can only be root-overmounting (otherwise the set wouldn't be non-shifting) and their parents can not belong to the original set, since the original is guaranteed to be closed. Putting all of that together The plan is to * find all candidates * trim down to maximal non-shifting subset * trim down to maximal non-revealing subset * reparent anything that needs to be reparented * return the resulting set to the caller For the 2nd and 3rd steps we want to separate the set into growing non-revealing subset, initially containing the original set ("U" in terms of the pseudocode above) and everything we are still not sure about ("C"). It means that for the output of the 1st step we'd like the extra candidates separated from the stuff already in the original set. For the 4th step we would like the additions to U separate from the original set. So let's go for * original set ("set"). Linkage via mnt_list * unclassified candidates ("C"). Linkage via mnt_umounting. * anything in U that hadn't been in the original set - elements of C will gradually be either discarded or moved there. In other words, it's the candidates we have already decided to unmount. Its role is reasonably close to the old "to_umount", so let's use that name. Linkage via mnt_list. For gather_candidates() we'll need to maintain both C (S - set) and intersection of S with set, with the latter emptied once we are done. Use mnt_umounting for both, that'll give a cheap way to check belonging to their union (in gather_candidates()) and to C itself (at later stages). Call that predicate is_candidate(); it would be !list_empty(&m->mnt_umounting) All elements of the original set are marked with MNT_UMOUNT and we'll need the same for elements added when joining the contents of to_umount to set in the end. Let's set MNT_UMOUNT at the time we add an element to to_umount; that's close to what the old 'umount_one' is doing, so let's keep that name. It also gives us another predicate we need - "belongs to union of set and to_umount"; will_be_unmounted() for now. Another helper - unmark and remove from C; remove_candidate(m) propagate_unmount(set) { // C <- candidates gather_candidates(set); // make set + C + to_umount non-shifting for (m = first(C); m; m = trim_one(m)) ; // ... and non-revealing while (!empty(C)) handle_locked(first(C)); // now to_umount consists of all non-rejected candidates // deal with reparenting of remaining overmounts on those list_for_each_entry(m, to_umount, mnt_list) { while (!empty(children(m))) // should be at most one reparent(first(children(m))) } list_splice_tail_init(set, &to_umount); } gather_candidates(set) { LIST_HEAD(visited) // intersection of S with set C = {} // S - set, i.e. candidates list_for_each_entry(m, set, mnt_list) { if (is_candidate(m)) continue; list_add(&m->mnt_umounting, &visited); p = parent(m); q = propagation_next(p, p); while (q) { child = look_up(q, mountpoint(m)); if (child) { if (is_candidate(child)) { q = skip_them(q, p); continue; } if (will_be_unmounted(child)) head = &visited; else head = &C; list_add(&child->mnt_umounting, head); } q = propagation_next(q, p); } } while (!list_empty(&visited)) // empty visited list_del_init(visited.next); } trim_one(m) { bool remove_this = false, found = false, umount_this = false; struct mount *res, *n; for each n in children(m) { if (!is_candidate(n)) { found = true; if (mountpoint(n) != root(m)) { remove_this = true; break; } } } if (found) { struct mount *this = m; for (struct mount *p = parent(this); !marked(this) && is_candidate(p); this = p, p = parent(p)) { mark(this); if (mountpoint(this) != root(p)) remove_candidate(p); } } else if (!locked(m) && children(m) is empty) { remove_this = true; umount_this = true; } res = next(m); if (remove_this) { remove_candidate(m); if (umount_this) umount_one(m); } return res; } handle_locked(m) { struct mount *cutoff = m, *p; for (p = m; is_candidate(p); p = parent(p)) { remove_candidate(p); if (!locked(p)) cutoff = parent(p); } if (will_be_unmounted(p)) cutoff = p; while (m != cutoff) { umount_one(m); m = parent(m); } } reparent(m) { struct mount *p = m; do { mp = p->mnt_mp; p = parent(p); } while (will_be_unmounted(p)); mnt_change_mountpoint(p, mp, m); } is_candidate(m) return !list_empty(&m->mnt_umounting); will_be_unmounted(m) return m->mnt.mnt_flags & MNT_UMOUNT; umount_one(m) // does *not* unmark or remove from C, unlike the old namesake { m->mnt.mnt_flags |= MNT_UMOUNT; list_del_init(&m->mnt_child); // remove from parent's list of children move_from_ns(m, &to_umount); } remove_candidate(m) { unmark(m); list_del_init(&m->mnt_umounting); // remove m from C }