On Mon, Jun 30, 2025 at 06:48:40PM +1000, NeilBrown wrote: > Checking the name first is definitely cleaner. The fact that > d_in_lookup() allows the rest to be short-circuited is neat but > certainly deserves the comment. FWIW, I wonder what would be the best tree for the commit below; it's a resurrected patch from last year, slightly updated to cover the in-lookup vs. ->d_compare(). By default it'll probably go into viro/vfs.git #work.misc... [PATCH] documenting RCU pathwalk exposure from filesystems' POV Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> --- diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 11a599387266..ce11c9bde4b3 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -41,6 +41,8 @@ algorithms work. caching/index + rcu-exposure + porting Filesystem support layers diff --git a/Documentation/filesystems/rcu-exposure.rst b/Documentation/filesystems/rcu-exposure.rst new file mode 100644 index 000000000000..2c60f115a267 --- /dev/null +++ b/Documentation/filesystems/rcu-exposure.rst @@ -0,0 +1,426 @@ +=================================================================== +The ways in which RCU pathwalk can ruin filesystem maintainer's day +=================================================================== + +Pathname resolution fast path: the scarier conditions for filesystem code +========================================================================= + +The basic safety assumption for any multithreaded OO code is that +objects passed to a function will not have been already destroyed by +another thread. It may guaranteed by various mechanisms (refcounting, +some form of exclusion, not having pointers to the object visible to other +threads, etc.), but no matter how it's done, the caller is expected to +have done that. + +Usually one wants a stronger warranty - that destructors of any objects +passed to a function would not be called by other threads until the +function has either done something that explicitly allows that to happen +(e.g. dropped a reference to refcounted object that had been passed +to it, stored a pointer in a shared data structure, etc.) or finished +its execution. + +Filesystem methods **usually** can count upon such warranties regarding +the lifetime of objects they are called to act upon. + +However, there is one important case where providing such warranties +is rather costly. On the fast path in pathname resolution (the case +when everything we need is in VFS caches), grabbing and dropping dentry +references for every visited directory would cause unpleasant scalability +issues. + +The problem is that access patterns are heavily biased; every system call +getting an absolute pathname will have to start at root directory, etc. +Having each of them in effect write "I'd been here" on the same memory +objects would cost quite a bit. + +To deal with that we try to keep the fast path stores-free, bumping +no refcounts and taking no locks. Details are described elsewhere +(Documentation/filesystems/path-lookup.txt), but the bottom line for +filesystems is that the methods involved in fast path of pathname +resolution may be called with much looser warranties than usual. +The caller deliberately does *not* take the steps that would normally +protect the object from being torn down by another thread while the +method is trying to work with it. This applies not just to dentries - +associated inodes and even the filesystem instance the object belongs +to could be in process of getting torn down (yes, really).\ [#f0]_ +Of course, from the filesystem point of view, every call like that is +a potential source of headache. + +Such unsafe method calls do get *some* warranties. Before going into +the details, keep in mind that + +* few methods are affected, and their defaults (i.e. what we get if the + method is left NULL) are safe. If your filesystem does not need to + override any of those defaults, you are fine (there is a minor nit + regarding the cached fast symlinks, but that's easy to take care of). + +* for most of those methods there is a way to bail out and tell VFS to + leave the fast path and switch to holding proper references from + that point on. That's what has to happen when an instance sees that it + can't go on without a blocking operation, but you can use the same + mechanism to bail out as soon as you see an unsafe call. + + +Which methods are affected? +=========================== + + The list of the methods that could run into that fun: + +======================== ================================== =================== ==================== + method indication that the call is unsafe unprotected objects bailout return value +======================== ================================== =================== ==================== +->d_hash(d, ...) none - any call might be d +->d_compare(d, ...) none - any call might be d +->d_revalidate(d, f) f & LOOKUP_RCU d -ECHILD +->d_manage(d, f) f d -ECHILD +->permission(i, m) m & MAY_NOT_BLOCK i -ECHILD +->get_link(d, i, ...) d == NULL i ERR_PTR(-ECHILD) +->get_inode_acl(i, t, f) f == LOOKUP_RCU i ERR_PTR(-ECHILD) +======================== ================================== =================== ==================== + +Additionally, callback set by set_delayed_call() from unsafe call of +->get_link() will be run in the same environment; that one is usually not +a problem, though. + +For the sake of completeness, three of LSM methods +(->inode_permission(), ->inode_follow_link() and ->task_to_inode()) +might be called in similar environment, but that's a problem for LSM +crowd, not for filesystem folks. + + +Opting out +========== + +To large extent a filesystem can opt out of RCU pathwalk; that loses all +scalability benefits whenever your filesystem gets involved in pathname +resolution, though. If that's the way you choose to go, just make sure +that + +1. any non-default ->d_revalidate(), ->permission(), ->get_link() and + ->get_inode_acl() instance bails out if called by RCU pathwalk (see + below for details). Costs a couple of lines of boilerplate in each. + +2. if some symlink inodes have ->i_link set to a dynamically allocated + object, that object won't be freed without an RCU delay. Anything + coallocated with inode is fine, so's anything freed from ->free_inode(). + Usually comes for free, just remember to avoid freeing directly + from ->destroy_inode(). + +3. any ->d_hash() and ->d_compare() instances (if you have those) do + not access any filesystem objects. + +4. there's no ->d_manage() instances in your filesystem. + +If your case does not fit the above, the easy opt-out is not for you. +If so, you'll have to keep reading... + + +What is guaranteed and what is required? +======================================== + +Any method call is, of course, required not to crash - no stepping on +freed memory, etc. All of the unsafe calls listed above are done under +rcu_read_lock(), so they are not allowed to block. Further requirements +vary between the methods. + +Before going through the list of affected methods, several notes on +the things that *are* guaranteed: + +* if a reference to struct dentry is passed to such call, it will + not be freed until the method returns. The same goes for a reference to + struct inode and to struct super_block pointed to by ->d_sb or ->i_sb + members of dentry and inode respectively. Any of those might be in + process of being torn down or enter such state right under us; + the entire point of those unsafe calls is that we make them without + telling anyone they'd need to wait for us. + +* following ->d_parent and ->d_inode of such dentries is fine, + provided that it's done by READ_ONCE() (for ->d_inode the preferred + form is d_inode_rcu(dentry)). The value of ->d_parent is never going + to be NULL and it will again point to a struct dentry that will not be + freed until the method call finishes. The value of ->d_inode might + be NULL; if non-NULL, it'll be pointing to a struct inode that will + not be freed until the method call finishes. + +* none of the inodes passed to an unsafe call could have reached + fs/inode.c:evict() before the caller grabbed rcu_read_lock(). + +* for inodes 'not freed' means 'not entered ->free_inode()', so + anything that won't be destroyed until ->free_inode() is safe to access. + Anything synchronously destroyed in ->evict_inode() or ->destroy_inode() + is not safe; however, one can count upon the call_rcu() callbacks + issued in those yet to be entered. Note that unlike dentries and + superblocks, inodes are embedded into filesystem-private objects; + anything stored directly in the containing object is safe to access. + +* for dentries anything destroyed by ->d_prune() (synchronously or + not) is not safe; the same goes for the things synchronously destroyed + by ->d_release(). However, call_rcu() callbacks issued in ->d_release() + are yet to be entered. + +* for superblocks we can count upon call_rcu() callbacks issued + from inside the ->kill_sb() (including the ones issued from + ->put_super()) yet to be entered. You can also count upon + ->s_user_ns still being pinned and ->s_security still not + freed. + +* NOTE: we **can not** count upon the things like ->d_parent + being positive (or a directory); a race with rename()+rmdir()+mknod() + and you might find a FIFO as parent's inode. NULL is even easier - + just have the dentry and its ex-parent already past dentry_kill() + (which is a normal situation for eviction on memory pressure) and there + you go. Normally such pathologies are prevented by the locking (and + dentry refcounting), but... the entire point of that stuff is to avoid + informing anyone that we are there, so those mechanisms are bypassed. + What's more, if dentry is not pinned by refcount, grabbing its ->d_lock + will *not* suffice to prevent that kind of mess - the scenario with + eviction by memory pressure won't be prevented by that; you might have + grabbed ->d_lock only after the dentry_kill() had released it, and + at that point ->d_parent still points to what used to be the parent, + but there's nothing to prevent its eviction. + + +->d_compare() +------------- + +For ->d_compare() we just need to make sure it won't crash when called +for dying dentry - an incorrect return value won't harm the caller in +such case. False positives and false negatives alike - the callers take +care of that. To be pedantic, make that "false positives do not cause +problems unless they have ->d_manage()", but ->d_manage() is present +only on autofs and there's no autofs ->d_compare() instances.\ [#f1]_ + +There is no indication that ->d_compare() is called in RCU mode; +the majority of callers are such, anyway, so we need to cope with that. +VFS guarantees that dentry won't be freed under us; the same goes for +the superblock pointed to by its ->d_sb. Name points to memory object +that won't get freed under us and length does not exceed the size of +that object. The contents of that object is *NOT* guaranteed to be +stable; d_move() might race with us, modifying the name. However, in +that case we are free to return an arbitrary result - the callers will +take care of both false positives and false negatives in such case. +The name we are comparing dentry with (passed in qstr) is stable, +thankfully... + +If we need to access any other data, it's up to the filesystem +to protect it. In practice it means that destruction of fs-private part +of superblock (and possibly unicode tables hanging off it, etc.) might +need to be RCU-delayed. + +*IF* you want the behaviour that varies depending upon the parent +directory, you get to be very careful with READ_ONCE() and watch out +for the object lifetimes. + +A dying dentry is not the only unpleasantness we can run into - it's +possible to run into an in-lookup dentry. If ->d_compare() instance +does not care about anything we might set up during ->lookup(), this +is not a problem; if it does, the things get delicate. Only one in-tree +instance is pathological to that extent (proc_sys_compare()). False +positives are permissible in that case; false negatives are not. + +Basically, if the things get that tricky, ask for help. +Currently there are two such instances in the tree - proc_sys_compare() +and generic_ci_d_compare(). Both are... special. + +->d_hash() +---------- + +For ->d_hash() on a dying dentry we are free to report any hash +value; the only extra requirement is that we should not return stray +hard errors. In other words, if we return anything other than 0 or +-ECHILD, we'd better make sure that this error would've been correct +before the parent started dying. Since ->d_hash() error reporting is +usually done to reject unacceptable names (too long, contain unsuitable +characters for this filesystem, etc.), that's really not a problem - +hard errors depend only upon the name, not the parent. + +Again, VFS guarantees that freeing of dentry and of the superblock +pointed to by dentry->d_sb won't happen under us. The name passed to +us (in qstr) is stable. If you need anything beyond that, you are +in the same situation as with ->d_compare(). Might want to RCU-delay +freeing private part of superblock (if that's what we need to access), +might want the same for some objects hanging off that (unicode tables, +etc.). If you need something beyond that - ask for help. + + +->d_revalidate() +---------------- + +For this one we do have an indication of call being unsafe - +flags & LOOKUP_RCU. With ->d_revalidate we are always allowed to bail +out and return -ECHILD; that will have the caller drop out of RCU mode. +We definitely need to do that if revalidate would require any kind of IO, +mutex-taking, etc.; we can't block in RCU mode. + +Quite a few instances of ->d_revalidate() simply treat LOOKUP_RCU +in flags as "return -ECHILD and be done with that"; it's guaranteed to +do the right thing, but you lose the benefits of RCU pathwalks whenever +you run into such dentry. + +Same as with the previous methods, we are guaranteed that +dentry and dentry->d_sb won't be freed under us. We are also guaranteed +that ->d_parent (which is *not* stable, so use READ_ONCE) points to a +struct dentry that won't get freed under us. As always with ->d_parent, +it's not NULL - for a detached dentry it will point to dentry itself. +d_inode_rcu() of dentry and its parent will be either NULL or will +point to a struct inode that won't get freed under us. Anything beyond +than that is not guaranteed. We may find parent to be negative - it can +happen if we race with d_move() and removal of old parent. In that case +just return -ECHILD and be done with that. + +On non-RCU side you could use dget_parent() instead - that +would give a positive dentry and its ->d_inode would remain stable. +dget_parent() has to be paired with dput(), though, so it's not usable +in RCU mode. + +If you need fs-private objects associated with dentry, its parent +inode(s) or superblock - see the general notes above on how to access +those. + + +->d_manage() +------------ + +Can be called in RCU mode; gets an argument telling it if it has +been called so. Pretty much autofs-only; for everyone's sanity sake, +don't inflict more of those on the kernel. Definitely don't do that +without asking first... + + +->permission() +-------------- + +Can be called in RCU mode; that is indicated by MAY_NOT_BLOCK +in mask, and it can only happen for MAY_EXEC checks on directories. +In RCU mode it is not allowed to block, and it is allowed to bail out +by returning -ECHILD. It might be called for an inode that is getting +torn down, possibly along with its filesystem. Errors other than -ECHILD +should only be returned if they would've been returned in non-RCU mode; +several instances in procfs did (prior to 6.8-rc6) run afoul of that one. +That's an instructive example, BTW - what happens is that proc_pid_permission() +uses proc_get_task() to find the relevant process. proc_get_task() +uses PID reference stored in struct proc_inode our inode is embedded +into; inode can't have been freed yet, so fetching ->pid member in that +is safe. However, using the value you've fetched is a different story +- proc_evict_inode() would have passed it to put_pid() and replaced +it with NULL. Unsafe caller has no way to tell if that is happening +right under it. Solution: stop zeroing ->pid in proc_evict_inode() +and move put_pid() from proc_pid_evict_inode() to proc_free_inode(). +That's not all that is needed (there's access to procfs-private part of +superblock as well), but it does make a good example of how such stuff +can be dealt with. + +Note that idmap argument is safe on all calls - its destruction +is rcu-delayed. + +The amount of headache is seriously reduced (for now) by the fact +that a lot of instances boil down to generic_permission() (which will +do the right thing in RCU mode) when mask is MAY_EXEC | MAY_NOT_BLOCK. +If we ever extend RCU mode to other ->permission() callers, the thing will +get interesting; that's not likely to happen, though, unless access(2) +goes there [this is NOT a suggestion, folks]. + + +->get_link() +------------ + +Again, this can be called in RCU mode. Even if your ->d_revalidate() +always returns -ECHILD in RCU mode and kicks the pathwalk out of it, +you can't assume that ->get_link() won't be reached.\ [#f2]_ + +NULL dentry argument is an indicator of unsafe call; if you can't handle +it, just return ERR_PTR(-ECHILD). Any allocations you need to do (and +with this method you really might need that) should be done with GFP_ATOMIC +in the unsafe case. + +Whatever you pass to set_delayed_call() is going to be called +in the same mode as ->get_link() itself; not a problem for most of the +instances. The string you return needs to stay there until the +callback gets called or, if no callback is set, until at least the +freeing of inode. As usual, for an unsafe call the inode might be +in process of teardown, possibly along with the hosting filesystem. +The usual considerations apply. The same, BTW, applies to whatever +you set in ->i_link - it must stay around at least until ->free_inode(). + + +->get_inode_acl() +----------------- + +Very limited exposure for that one - unsafe call is possible +only if you explicitly set ACL_DONT_CACHE as cached ACL value. +Only two filesystems (fuse and overlayfs) even bother. Unsafe call +is indicated by explicit flag (the third argument of the method), +bailout is done by returning ERR_PTR(-CHILD) and the usual considerations +apply for any access to data structures you might need to do. + +.. rubric:: Footnotes + +.. [#f0] + + The fast path of pathname resolution really can run into a dentry on + a filesystem that is getting shut down. + + Here's one of the scenarios for that to happen: + + 1. have two threads sharing fs_struct chdir'ed on that filesystem. + 2. lazy-umount it, so that the only thing holding it alive is + cwd of these threads. + 3. the first thread does relative pathname resolution + and gets to e.g. ->d_hash(). It's holding rcu_read_lock(). + 4. at the same time the second thread does fchdir(), moving to + different directory. + + In fchdir(2) we get to set_fs_pwd(), which set the current directory + to the new place and does mntput() on the old one. No RCU delays here, + we calculate the refcount of that mount and see that we are dropping + the last reference. We make sure that the pathwalk in progress in + the first thread will fail when it comes to legitimize_mnt() and do this + (in mntput_no_expire()):: + + init_task_work(&mnt->mnt_rcu, __cleanup_mnt); + if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME)) + return; + + As we leave the syscall, we have __cleanup_mnt() run; it calls + cleanup_mnt() on our mount, which hits deactivate_super(). That was + the last reference to superblock. + + Voila - we have a filesystem shutdown right under the nose of + a thread running in ->d_hash() of something on that filesystem. + Mutatis mutandis, one can arrange the same for other methods called + by rcu pathwalk. + + It's not easy to hit (especially if you want to get through the + entire ->kill_sb() before the first thread gets through ->d_hash()), + and it's probably impossible on the real hardware; on KVM it might be + borderline doable. However, it is possible and I would not swear that + other ways of arranging the same thing are equally hard to hit. + + The bottom line: methods that can be called in RCU mode need to be + careful about the per-superblock objects destruction. + +.. [#f1] + + Some callers prevent being called for dying dentry (holding ->d_lock + and having verified !d_unhashed() or finding it in the list of inode's + aliases under ->i_lock). For those the scenario in question simply + cannot arise. + + Some follow the match with lockref_get_not_dead() and treat the failure + as mismatch. That takes care of false positives, and false negatives + on dying dentry are still correct - we simply pretend to have lost + the race. + + The only caller that does not fit into the classes above is + __d_lookup_rcu_op_compare(). There we sample ->d_seq and verify + !d_unhashed() before calling ->d_compare(). That is not enough to + prevent dentry from starting to die right under us; however, the + sampled value of ->d_seq will be rechecked when the caller gets to + step_into(), so for a false positive we will end up with a mismatch. + The corner case around ->d_manage() is due to the handle_mounts() + done before step_into() gets to ->d_seq validation... + +.. [#f2] + + binding a symlink on top of a regular file on another filesystem is + possible and that's all it takes for RCU pathwalk to get there.