> On Jun 26, 2025, at 3:51 PM, NeilBrown <neil@xxxxxxxxxx> wrote: [...] >> Unfortunately, the BPF use case is more complicated. In some cases, >> the callback function cannot be call in rcu critical sections. For >> example, the callback may need to read xatter. For these cases, we >> we cannot use RCU walk at all. > > I really think you should stop using the terms RCU walk and ref-walk. I > think they might be focusing your thinking in an unhelpful direction. > > The key issue about reading xattrs is that it might need to sleep. > Focusing on what might need to sleep and what will never need to sleep > is a useful approach - the distinction is wide spread in the kernel and > several function take a flag indicating if they are permitted to sleep, > or if failure when sleeping would be required. > > So your above observation is better described as > > The vfs_walk_ancestors() API has an (implicit) requirement that the > callback mustn't sleep. This is a problem for some use-cases > where the call back might need to sleep - e.g. for accessing xattrs. > > That is a good and useful observation. I can see three possibly > responses: > > 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is > always allowed to sleep. I don't particularly like this approach. > > 2/ Use repeated calls to vfs_walk_parent() when the handling of each > ancestor might need to sleep. I see no problem with supporting both > vfs_walk_ancestors() and vfs_walk_parent(). There is plenty of > precedent for having different interfaces for different use cases. I prefer option 2. > > 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. > If the callback finds that it needs to sleep but that "may sleep" > isn't set, it returns some well known status, like -EWOULDBLOCK (or > -ECHILD). It can expect to be called again but with "may sleep" set. > This is my preferred approach. There is precedent with the > d_revalidate callbacks which works like this. > I suspect that accessing xattrs might often be possible without > sleeping. It is conceivable that we could add a "may sleep" argument > to vfs_getxattr() so that it could still often be used without > requiring vfs_walk_ancestors() to permit sleeping. > This would almost certainly require a clear demonstration that > there was a performance cost in not having the option of non-sleeping > vfs_getxattr(). For built-in kernel code, I can see this works. However, I don’t see why it is necessary to introduce the extra complexity of -EWOULDBLOCK, and vfs_get_xattr_cannot_sleep, etc. A separate step-by-step walking API is much cleaner. > >>> I strongly suggest you stop thinking about rcu-walk vs ref-walk. Think >>> about the needs of your code. If you need a high-performance API, then >>> ask for a high-performance API, don't assume what form it will take or >>> what the internal implementation details will be. >> >> At the moment, we need a ref-walk API on the BPF side. The RCU walk >> is a totally separate topic. > > Do you mean "we need step-by-step walking" or do you mean "we need to > potentially sleep for each ancestor"? These are conceptually different > requirements, but I cannot tell which you mean when you talk about "RCU > walk”. To be extra clear, I mean we need "step-by-step and take-reference-on-each-step walking”, for existing use cases. In the future, if it is possible to have a “do-not-take-reference, cannot-sleep, callback-based walking”. We may try to use that for some use cases. But that won’t replace step-by-step walking for all users. Thanks, Song