Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings

Diangang Li <lidiangang@xxxxxxxxxxxxx> · Wed, 10 Sep 2025 15:38:55 +0800




On Thu, Feb 10, 2022 at 05:33:23AM +0000, Al Viro wrote:
> On Wed, Feb 09, 2022 at 03:14:03PM -0800, Stephen Brennan wrote:
> 
> > +static void sweep_negative(struct dentry *dentry)
> > +{
> > +	struct dentry *parent;
> > +
> > +	rcu_read_lock();
> > +	parent = lock_parent(dentry);
> > +	if (!parent) {
> > +		rcu_read_unlock();
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * If we did not hold a reference to dentry (as in the case of dput),
> > +	 * and dentry->d_lock was dropped in lock_parent(), then we could now be
> > +	 * holding onto a dead dentry. Be careful to check d_count and unlock
> > +	 * before dropping RCU lock, otherwise we could corrupt freed memory.
> > +	 */
> > +	if (!d_count(dentry) && d_is_negative(dentry) &&
> > +		!d_is_tail_negative(dentry)) {
> > +		dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
> > +		list_move_tail(&dentry->d_child, &parent->d_subdirs);
> > +	}
> > +
> > +	spin_unlock(&parent->d_lock);
> > +	spin_unlock(&dentry->d_lock);
> > +	rcu_read_unlock();
> > +}
> 
> 	I'm not sure if it came up the last time you'd posted this series
> (and I apologize if it had and I forgot the explanation), but... consider
> the comment in dentry_unlist().  What's to prevent the race described there
> making d_walk() skip a part of tree, by replacing the "lseek moving cursor
> in just the wrong moment" with "dput moving the negative dentry right next
> to the one being killed to the tail of the list"?
> 
> 	The race in question:
> d_walk() is leaving a subdirectory.  We are here:
>         rcu_read_lock();
> ascend:
>         if (this_parent != parent) {
> 
> It isn't - we are not back to the root of tree being walked.
> At this point this_parent is the directory we'd just finished looking into.
> 
>                 struct dentry *child = this_parent;
>                 this_parent = child->d_parent;
> 
> ... and now child points to it, and this_parent points to its parent.
> 
>                 spin_unlock(&child->d_lock);
> 
> No locks held.  Another CPU gets through successful rmdir().  child gets
> unhashed and dropped.  It's off the ->d_subdirs of this_parent; its
> ->d_child.next is still pointing where it used to, and whatever it points
> to won't be physically freed until rcu_read_unlock().
> 
> Moreover, in the meanwhile this next sibling (negative, pinned) got dput().
> And had been moved to the tail of the this_parent->d_subdirs.  Since
> its ->d_child.prev does *NOT* point to child (which is off-list, about to
> be freed shortly, etc.), child->d_dchild.next is not modified - it still
> points to that (now moved) sibling.
> 
>                 spin_lock(&this_parent->d_lock);
> Got it.
> 
>                 /* might go back up the wrong parent if we have had a rename. */
>                 if (need_seqretry(&rename_lock, seq))
>                         goto rename_retry;
> 
> Nope, hadn't happened.
> 
>                 /* go into the first sibling still alive */
>                 do {
>                         next = child->d_child.next;
> ... and this is the moved sibling, now in the end of the ->d_subdirs.
> 
>                         if (next == &this_parent->d_subdirs)
>                                 goto ascend;
> 
> No, it is not - it's the last element of the list, not its anchor.
> 
>                         child = list_entry(next, struct dentry, d_child);
> 
> Our moved negative dentry.
> 
>                 } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
> 
> Not killed, that one.
>                 rcu_read_unlock();
>                 goto resume;
> 
> ... and since that sucker has no children, we proceed to look at it,
> ascend and now we are at the end of this_parent->d_subdirs.  And we
> ascend out of it, having entirely skipped all branches that used to
> be between the rmdir victim and the end of the parent's ->d_subdirs.
> 
> What am I missing here?  Unlike the trick we used with cursors (see
> dentry_unlist()) we can't predict who won't get moved in this case...
> 
> Note that treating "child is has DCACHE_DENTRY_KILLED" same as we do
> for rename_lock mismatches would not work unless you grab the spinlock
> component of rename_lock every time dentry becomes positive.  Which
> is obviously not feasible (it's a system-wide lock and cacheline
> pingpong alone would hurt us very badly, not to mention the contention
> issues due to the frequency of grabbing it going up by several orders
> of magnitude).

Hi Al, Stephen,

How about adding a pointer in union d_u to record the original next of negative dentry
when executing `sweep_negative`, as follows:

@@ -125,6 +126,7 @@ struct dentry {
 		struct hlist_node d_alias;	/* inode alias list */
 		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
 	 	struct rcu_head d_rcu;
+
+		/* valid between sweep_negative and recyle_negative */
+		struct hlist_node *d_sib_backup;
 	} d_u;
 };

Then, during `d_walk`, once we find a dentry with `DCACHE_DENTRY_KILLED`, and its next
with `DCACHE_TAIL_NEGATIVE`, iterate to get the genuine next dentry in d_children.

@@ -1357,8 +1405,22 @@ static void d_walk(struct dentry *parent, void *data,
    /* might go back up the wrong parent if we have had a rename. */
    if (need_seqretry(&rename_lock, seq))
        goto rename_retry;
+
+		next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib);
+		do {
+			if (!(next && next->d_u.d_sib_backup) ||
+			    !(dentry->d_flags & DCACHE_DENTRY_KILLED) ||
+			    !(next->d_flags & DCACHE_TAIL_NEGATIVE)) {
+				dentry = next;
+				break;
+			}
+			dentry = hlist_entry_safe(next->d_u.d_sib_backup, struct dentry, d_sib);
+			next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib);
+		} while (true);
+
		/* go into the first sibling still alive */
-		hlist_for_each_entry_continue(dentry, d_sib) {
+		hlist_for_each_entry_from(dentry, d_sib) {

And since `d_children` changed from `list_head` to `hlist_head`, we cannot move a negative dentry
to the tail of d_children directly. Instead, add another `d_negative` list to cache negative
dentries. For the majority of d_children traversal, we only care about positive dentries.

@@ -118,6 +118,7 @@ struct dentry {
 	};
 	struct hlist_node d_sib;	/* child of parent list */
 	struct hlist_head d_children;	/* our children */
+	struct hlist_head d_negative;	/* cached negative dentries */

The commit `41f49be2e51a71` ("fsnotify: clear PARENT_WATCHED flags lazily") has resolved the
softlockup in `__fsnotify_parent`, but we are still seeing softlockup in `fsnotify_recalc_mask`
when adding watches with millions of negative dentries. As noted in [1], we have encountered the
same issue. In our case, the negative dentries are accumulated primarily by opening non-existent
files. The `dentry-negative` sysctl introduced in commit `e6957c99dca5f`
("vfs: Add a sysctl for automated deletion of dentry") does not seem to have any effect in our
scenario. Given this, we believe that splitting the `d_children` list into positive and negative
dentries, and skipping the negatives in `fsnotify_set_children_dentry_flags`, is still a
valuable approach.

[1] https://lore.kernel.org/linux-fsdevel/CAJfpegs+czRD1=s+o5yNoOp13xH+utQ8jQkJ9ec5283MNT_xmg@xxxxxxxxxxxxxx/


Regards,
Diangang