Machine lockup with large d_invalidate()

Jan Kara <jack@xxxxxxx> · Thu, 15 May 2025 16:54:36 +0200

Hello,

we have a customer who is mounting over NFS a directory (let's call it
hugedir) with many files (there are several millions dentries on d_children
list). Now when they do 'mv hugedir hugedir.bak; mkdir hugedir' on the
server, which invalidates NFS cache of this directory, NFS clients get
stuck in d_invalidate() for hours (until the customer lost patience).

Now I don't want to discuss here sanity or efficiency of this application
architecture but I'm sharing the opinion that it shouldn't take hours to
invalidate couple million dentries. Analysis of the crashdump revealed that
d_invalidate() can have O(n^2) complexity with the number of dentries it is
invalidating which leads to impractical times to invalidate large numbers
of dentries. What happens is the following:

There are several processes accessing the hugedir directory - about 16 in
the case I was inspecting. When the directory changes on the server all
these 16 processes quickly enter d_invalidate() -> shrink_dcache_parent()
(actually the kernel is old so the function names are somewhat different
but the same seems to be applicable to current upstream kernel AFAICT). In
shrink_dcache_parent() we use d_walk() to collect dentries to invalidate.
The processes doing d_walk() are synchronized on hughedir->d_lock so they
operate sequentially. Now select_collect() handler returns D_WALK_QUIT as
soon as need_resched() is set and it has moved at least one dentry into the
dispose list. Hence the first task that does d_walk() moves around 90k
entries to dispose list before it decides to take a break. The second task
has a somewhat harder work - it needs to skip over those 90k entries moved
to dispose list (as they are still in hugedir->d_children list) and only
then starts moving dentries to dispose list. It has managed to move 50k
dentries before it decided to reschedule. Finally, from about fifth task
that called d_invalidate() the task always spends its whole timeslice
skipping over moved dentries in hugedir->d_chidren list, then moves one
dentry to dispose list and aborts scanning to reschedule.

So we are in a situation where a few tasks have tens of thousands entries
in their dispose lists and a larger number of tasks that have only one dentry
in their dispose lists. What happens next is that tasks eventually get
scheduled again and proceed to process dentries in dispose list (in fact in
the current kernel the first scheduling point seems to be in
__dentry_kill() when we are evicting the first dentry from the dispose
list). If this was a task that had only single dentry in its dispose list, it
rather quickly evicts it and cycles back to d_walk() - only to scan
hugedir->d_children, skip huge number of already moved dentries and finally
move another one dentry to dispose list. While this is happening, all other
tasks that want to evict dentries from their dispose lists are spinning on
hugedir->d_lock. So the overall progress is very slow because we always
evict only a few dentries and then someone gets to scanning
hugedir->d_children, blocks everybody for the whole timeslice only to move
one more dentry to its dispose list...

The question is how to best fix this very inefficient behavior. One
possible approach would be to somehow completely serialize
shrink_dcache_parent() operation for a directory. That way we'd avoid the
expensive rescanning of already moved dentries as well as the contention on
d_lock between tasks. On the other hand deep directory hierarchies can
possibly benefit from multiple tasks reaping them so it isn't a clear
universal win.

Another possible approach is to remember position in d_children list where
we stopped scanning when we decided to reschedule (I suppose we could use
DCACHE_DENTRY_CURSOR for that). This way there would still be some amount
of skipping of already moved dentries but at least the overall complexity
would get from O(n^2) to O(t*n) where 't' is the number of tasks trying to
invalidate the directory so I think the performance will be acceptable.

What do people think about these solutions? Anything I'm missing?

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR