On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote: > On Fri, 27 Jun 2025, Song Liu wrote: > > > > > > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@xxxxxxxxxx> wrote: > > > > [...] > > > > >> I guess I misunderstood the proposal of vfs_walk_ancestors() > > >> initially, so some clarification: > > >> > > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some > > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. > > >> A reliable step-by-step ref-walk, like this set, works well with > > >> BPF, and we want to keep it. > > > > > > The distinction between rcu-walk and ref-walk is an internal > > > implementation detail. You as a caller shouldn't need to think about > > > the difference. You just want to walk. Note that LOOKUP_RCU is > > > documented in namei.h as "semi-internal". The only uses outside of > > > core-VFS code is in individual filesystem's d_revalidate handler - they > > > are checking if they are allowed to sleep or not. You should never > > > expect to pass LOOKUP_RCU to an VFS API - no other code does. > > > > > > It might be reasonable for you as a caller to have some control over > > > whether the call can sleep or not. LOOKUP_CACHED is a bit like that. > > > But for dotdot lookup the code will never sleep - so that is not > > > relevant. > > > > Unfortunately, the BPF use case is more complicated. In some cases, > > the callback function cannot be call in rcu critical sections. For > > example, the callback may need to read xatter. For these cases, we > > we cannot use RCU walk at all. > > I really think you should stop using the terms RCU walk and ref-walk. I > think they might be focusing your thinking in an unhelpful direction. Thank you! I really appreciate you helping to shape this API and it aligns a lot with my thinking. > The key issue about reading xattrs is that it might need to sleep. > Focusing on what might need to sleep and what will never need to sleep > is a useful approach - the distinction is wide spread in the kernel and > several function take a flag indicating if they are permitted to sleep, > or if failure when sleeping would be required. > > So your above observation is better described as > > The vfs_walk_ancestors() API has an (implicit) requirement that the > callback mustn't sleep. This is a problem for some use-cases > where the call back might need to sleep - e.g. for accessing xattrs. > > That is a good and useful observation. I can see three possibly > responses: > > 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is > always allowed to sleep. I don't particularly like this approach. Agreed. > > 2/ Use repeated calls to vfs_walk_parent() when the handling of each > ancestor might need to sleep. I see no problem with supporting both > vfs_walk_ancestors() and vfs_walk_parent(). There is plenty of > precedent for having different interfaces for different use cases. Meh. > > 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback. I think that's fine.