On 6/24/25 6:15 PM, NeilBrown wrote: > On Wed, 25 Jun 2025, Chuck Lever wrote: >> What is more interesting to me is trying out more sophisticated abstract >> data types for the DRC hashtable. rhashtable is one alternative; so is >> Maple tree, which is supposed to handle lookups with more memory >> bandwidth efficiency than walking a linked list. >> > > While I generally like rhashtable there is an awkwardness. It doesn't > guarantee that an insert will always succeed. If you get lots of new > records that hash to the same value, it will start failing insert > requests until is hash re-hashed the table with a new seed. Hm. I hadn't thought of that. > This is > intended to defeat collision attacks. That means we would need to drop > requests sometimes. Maybe that is OK. The DRC could be the target of > collision attacks so maybe we really do want to drop requests if > rhashtable refuses to store them. Well I can imagine, in a large cohort of clients, there is a pretty good probability of non-malicious XID collisions due to the birthday paradox. > I think the other area that could use improvement is pruning old entries. > I would not include RC_INPROG entries in the lru at all - they are > always ignored, and will be added when they are switched to RCU_DONE. That sounds intriguing. > I'd generally like to prune less often in larger batches, but removing > each of the batch from the rbtree could hold the lock for longer than we > would like. Have a look at 8847ecc9274a ("NFSD: Optimize DRC bucket pruning"). Pruning frequently by small amounts seems to have the greatest benefit. It certainly does keep request latency jitter down, since NFSD prunes while the client is waiting. If we can move some management of the cache until after the reply is sent, that might offer opportunities to prune more aggressively without impacting server responsiveness. > I wonder if we could have an 'old' and a 'new' rbtree and > when the 'old' gets too old or the 'new' get too full, we extract 'old', > move 'new' to 'old', and outside the spinlock we free all of the moved > 'old'. One observation I've had is that nearly every DRC lookup will fail to find an entry that matches the XID, because when things are operating smoothly, every incoming RPC contains an XID that hasn't been seen before. That means DRC lookups are walking the entire bucket in the common case. Pointer chasing of any kind is a well-known ADT performance killer. My experience with the kernel's r-b tree is that is does not perform well due to the number of memory accesses needed for lookups. This is why I suggested using rhashtable -- it makes an effort to keep bucket sizes small by widening the table frequently. The downside is that this will definitely introduce some latency when an insertion triggers a table-size change. What might be helpful is a per-bucket Bloom filter that would make checking if an XID is in the hashed bucket an O(1) operation -- and in particular, would require few, if any, pointer dereferences. > But if we switched to rhashtable, we probably wouldn't need an lru - > just walk the entire table occasionally - there would be little conflict > with concurrent lookups. When the DRC is at capacity, pruning needs to find something to evict on every insertion. My thought is that a pruning walk would need to be done quite frequently to ensure clients don't overrun the cache. Thus attention needs to be paid to keep pruning efficient (although perhaps an LRU isn't the only choice here). -- Chuck Lever