Hello, we have a customer who is mounting over NFS a directory (let's call it hugedir) with many files (there are several millions dentries on d_children list). Now when they do 'mv hugedir hugedir.bak; mkdir hugedir' on the server, which invalidates NFS cache of this directory, NFS clients get stuck in d_invalidate() for hours (until the customer lost patience). Now I don't want to discuss here sanity or efficiency of this application architecture but I'm sharing the opinion that it shouldn't take hours to invalidate couple million dentries. Analysis of the crashdump revealed that d_invalidate() can have O(n^2) complexity with the number of dentries it is invalidating which leads to impractical times to invalidate large numbers of dentries. What happens is the following: There are several processes accessing the hugedir directory - about 16 in the case I was inspecting. When the directory changes on the server all these 16 processes quickly enter d_invalidate() -> shrink_dcache_parent() (actually the kernel is old so the function names are somewhat different but the same seems to be applicable to current upstream kernel AFAICT). In shrink_dcache_parent() we use d_walk() to collect dentries to invalidate. The processes doing d_walk() are synchronized on hughedir->d_lock so they operate sequentially. Now select_collect() handler returns D_WALK_QUIT as soon as need_resched() is set and it has moved at least one dentry into the dispose list. Hence the first task that does d_walk() moves around 90k entries to dispose list before it decides to take a break. The second task has a somewhat harder work - it needs to skip over those 90k entries moved to dispose list (as they are still in hugedir->d_children list) and only then starts moving dentries to dispose list. It has managed to move 50k dentries before it decided to reschedule. Finally, from about fifth task that called d_invalidate() the task always spends its whole timeslice skipping over moved dentries in hugedir->d_chidren list, then moves one dentry to dispose list and aborts scanning to reschedule. So we are in a situation where a few tasks have tens of thousands entries in their dispose lists and a larger number of tasks that have only one dentry in their dispose lists. What happens next is that tasks eventually get scheduled again and proceed to process dentries in dispose list (in fact in the current kernel the first scheduling point seems to be in __dentry_kill() when we are evicting the first dentry from the dispose list). If this was a task that had only single dentry in its dispose list, it rather quickly evicts it and cycles back to d_walk() - only to scan hugedir->d_children, skip huge number of already moved dentries and finally move another one dentry to dispose list. While this is happening, all other tasks that want to evict dentries from their dispose lists are spinning on hugedir->d_lock. So the overall progress is very slow because we always evict only a few dentries and then someone gets to scanning hugedir->d_children, blocks everybody for the whole timeslice only to move one more dentry to its dispose list... The question is how to best fix this very inefficient behavior. One possible approach would be to somehow completely serialize shrink_dcache_parent() operation for a directory. That way we'd avoid the expensive rescanning of already moved dentries as well as the contention on d_lock between tasks. On the other hand deep directory hierarchies can possibly benefit from multiple tasks reaping them so it isn't a clear universal win. Another possible approach is to remember position in d_children list where we stopped scanning when we decided to reschedule (I suppose we could use DCACHE_DENTRY_CURSOR for that). This way there would still be some amount of skipping of already moved dentries but at least the overall complexity would get from O(n^2) to O(t*n) where 't' is the number of tasks trying to invalidate the directory so I think the performance will be acceptable. What do people think about these solutions? Anything I'm missing? Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR