On Thu, Feb 10, 2022 at 05:33:23AM +0000, Al Viro wrote: > On Wed, Feb 09, 2022 at 03:14:03PM -0800, Stephen Brennan wrote: > > > +static void sweep_negative(struct dentry *dentry) > > +{ > > + struct dentry *parent; > > + > > + rcu_read_lock(); > > + parent = lock_parent(dentry); > > + if (!parent) { > > + rcu_read_unlock(); > > + return; > > + } > > + > > + /* > > + * If we did not hold a reference to dentry (as in the case of dput), > > + * and dentry->d_lock was dropped in lock_parent(), then we could now be > > + * holding onto a dead dentry. Be careful to check d_count and unlock > > + * before dropping RCU lock, otherwise we could corrupt freed memory. > > + */ > > + if (!d_count(dentry) && d_is_negative(dentry) && > > + !d_is_tail_negative(dentry)) { > > + dentry->d_flags |= DCACHE_TAIL_NEGATIVE; > > + list_move_tail(&dentry->d_child, &parent->d_subdirs); > > + } > > + > > + spin_unlock(&parent->d_lock); > > + spin_unlock(&dentry->d_lock); > > + rcu_read_unlock(); > > +} > > I'm not sure if it came up the last time you'd posted this series > (and I apologize if it had and I forgot the explanation), but... consider > the comment in dentry_unlist(). What's to prevent the race described there > making d_walk() skip a part of tree, by replacing the "lseek moving cursor > in just the wrong moment" with "dput moving the negative dentry right next > to the one being killed to the tail of the list"? > > The race in question: > d_walk() is leaving a subdirectory. We are here: > rcu_read_lock(); > ascend: > if (this_parent != parent) { > > It isn't - we are not back to the root of tree being walked. > At this point this_parent is the directory we'd just finished looking into. > > struct dentry *child = this_parent; > this_parent = child->d_parent; > > ... and now child points to it, and this_parent points to its parent. > > spin_unlock(&child->d_lock); > > No locks held. Another CPU gets through successful rmdir(). child gets > unhashed and dropped. It's off the ->d_subdirs of this_parent; its > ->d_child.next is still pointing where it used to, and whatever it points > to won't be physically freed until rcu_read_unlock(). > > Moreover, in the meanwhile this next sibling (negative, pinned) got dput(). > And had been moved to the tail of the this_parent->d_subdirs. Since > its ->d_child.prev does *NOT* point to child (which is off-list, about to > be freed shortly, etc.), child->d_dchild.next is not modified - it still > points to that (now moved) sibling. > > spin_lock(&this_parent->d_lock); > Got it. > > /* might go back up the wrong parent if we have had a rename. */ > if (need_seqretry(&rename_lock, seq)) > goto rename_retry; > > Nope, hadn't happened. > > /* go into the first sibling still alive */ > do { > next = child->d_child.next; > ... and this is the moved sibling, now in the end of the ->d_subdirs. > > if (next == &this_parent->d_subdirs) > goto ascend; > > No, it is not - it's the last element of the list, not its anchor. > > child = list_entry(next, struct dentry, d_child); > > Our moved negative dentry. > > } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED)); > > Not killed, that one. > rcu_read_unlock(); > goto resume; > > ... and since that sucker has no children, we proceed to look at it, > ascend and now we are at the end of this_parent->d_subdirs. And we > ascend out of it, having entirely skipped all branches that used to > be between the rmdir victim and the end of the parent's ->d_subdirs. > > What am I missing here? Unlike the trick we used with cursors (see > dentry_unlist()) we can't predict who won't get moved in this case... > > Note that treating "child is has DCACHE_DENTRY_KILLED" same as we do > for rename_lock mismatches would not work unless you grab the spinlock > component of rename_lock every time dentry becomes positive. Which > is obviously not feasible (it's a system-wide lock and cacheline > pingpong alone would hurt us very badly, not to mention the contention > issues due to the frequency of grabbing it going up by several orders > of magnitude). Hi Al, Stephen, How about adding a pointer in union d_u to record the original next of negative dentry when executing `sweep_negative`, as follows: @@ -125,6 +126,7 @@ struct dentry { struct hlist_node d_alias; /* inode alias list */ struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */ struct rcu_head d_rcu; + + /* valid between sweep_negative and recyle_negative */ + struct hlist_node *d_sib_backup; } d_u; }; Then, during `d_walk`, once we find a dentry with `DCACHE_DENTRY_KILLED`, and its next with `DCACHE_TAIL_NEGATIVE`, iterate to get the genuine next dentry in d_children. @@ -1357,8 +1405,22 @@ static void d_walk(struct dentry *parent, void *data, /* might go back up the wrong parent if we have had a rename. */ if (need_seqretry(&rename_lock, seq)) goto rename_retry; + + next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib); + do { + if (!(next && next->d_u.d_sib_backup) || + !(dentry->d_flags & DCACHE_DENTRY_KILLED) || + !(next->d_flags & DCACHE_TAIL_NEGATIVE)) { + dentry = next; + break; + } + dentry = hlist_entry_safe(next->d_u.d_sib_backup, struct dentry, d_sib); + next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib); + } while (true); + /* go into the first sibling still alive */ - hlist_for_each_entry_continue(dentry, d_sib) { + hlist_for_each_entry_from(dentry, d_sib) { And since `d_children` changed from `list_head` to `hlist_head`, we cannot move a negative dentry to the tail of d_children directly. Instead, add another `d_negative` list to cache negative dentries. For the majority of d_children traversal, we only care about positive dentries. @@ -118,6 +118,7 @@ struct dentry { }; struct hlist_node d_sib; /* child of parent list */ struct hlist_head d_children; /* our children */ + struct hlist_head d_negative; /* cached negative dentries */ The commit `41f49be2e51a71` ("fsnotify: clear PARENT_WATCHED flags lazily") has resolved the softlockup in `__fsnotify_parent`, but we are still seeing softlockup in `fsnotify_recalc_mask` when adding watches with millions of negative dentries. As noted in [1], we have encountered the same issue. In our case, the negative dentries are accumulated primarily by opening non-existent files. The `dentry-negative` sysctl introduced in commit `e6957c99dca5f` ("vfs: Add a sysctl for automated deletion of dentry") does not seem to have any effect in our scenario. Given this, we believe that splitting the `d_children` list into positive and negative dentries, and skipping the negatives in `fsnotify_set_children_dentry_flags`, is still a valuable approach. [1] https://lore.kernel.org/linux-fsdevel/CAJfpegs+czRD1=s+o5yNoOp13xH+utQ8jQkJ9ec5283MNT_xmg@xxxxxxxxxxxxxx/ Regards, Diangang