Re: Machine lockup with large d_invalidate()

Miklos Szeredi <miklos@xxxxxxxxxx> · Thu, 15 May 2025 17:15:00 +0200

On Thu, 15 May 2025 at 16:57, Jan Kara <jack@xxxxxxx> wrote:
>
> Hello,
>
> we have a customer who is mounting over NFS a directory (let's call it
> hugedir) with many files (there are several millions dentries on d_children
> list). Now when they do 'mv hugedir hugedir.bak; mkdir hugedir' on the
> server, which invalidates NFS cache of this directory, NFS clients get
> stuck in d_invalidate() for hours (until the customer lost patience).
>
> Now I don't want to discuss here sanity or efficiency of this application
> architecture but I'm sharing the opinion that it shouldn't take hours to
> invalidate couple million dentries. Analysis of the crashdump revealed that
> d_invalidate() can have O(n^2) complexity with the number of dentries it is
> invalidating which leads to impractical times to invalidate large numbers
> of dentries. What happens is the following:
>
> There are several processes accessing the hugedir directory - about 16 in
> the case I was inspecting. When the directory changes on the server all
> these 16 processes quickly enter d_invalidate() -> shrink_dcache_parent()

First thing d_invalidate() does is check if the dentry is unhashed and
return if so, unhash it otherwise.   So only d_invalidate() that won
the race for d_lock is going to invoke shink_dcache_parent() the
others will return immediately.

What am I missing?

Thanks,
Miklos