Re: Machine lockup with large d_invalidate()

Miklos Szeredi <miklos@xxxxxxxxxx> · Thu, 15 May 2025 17:22:57 +0200



On Thu, 15 May 2025 at 17:15, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>
> On Thu, 15 May 2025 at 16:57, Jan Kara <jack@xxxxxxx> wrote:
> >
> > Hello,
> >
> > we have a customer who is mounting over NFS a directory (let's call it
> > hugedir) with many files (there are several millions dentries on d_children
> > list). Now when they do 'mv hugedir hugedir.bak; mkdir hugedir' on the
> > server, which invalidates NFS cache of this directory, NFS clients get
> > stuck in d_invalidate() for hours (until the customer lost patience).
> >
> > Now I don't want to discuss here sanity or efficiency of this application
> > architecture but I'm sharing the opinion that it shouldn't take hours to
> > invalidate couple million dentries. Analysis of the crashdump revealed that
> > d_invalidate() can have O(n^2) complexity with the number of dentries it is
> > invalidating which leads to impractical times to invalidate large numbers
> > of dentries. What happens is the following:
> >
> > There are several processes accessing the hugedir directory - about 16 in
> > the case I was inspecting. When the directory changes on the server all
> > these 16 processes quickly enter d_invalidate() -> shrink_dcache_parent()
>
> First thing d_invalidate() does is check if the dentry is unhashed and
> return if so, unhash it otherwise.   So only d_invalidate() that won
> the race for d_lock is going to invoke shink_dcache_parent() the
> others will return immediately.
>
> What am I missing?

It's it's an old kernel (<4.18) it might be missing commit
ff17fa561a04 ("d_invalidate(): unhash immediately")

Thanks,
Miklos