Am Di., 17. Juni 2025 um 17:42 Uhr schrieb Christian Brauner <brauner@xxxxxxxxxx>: > > Keep pidfs dentries around after a pidfd has been created for it. The > pidfs dentry will only be cleaned up once the struct pid gets reaped. > > The current scheme allocated pidfs dentries on-demand repeatedly. > This scheme is reaching it's limits as it makes it impossible to pin > information that needs to be available after the task has exited or > coredumped and that should not be lost simply because the pidfd got > closed temporarily. The next opener should still see the stashed > information. > > If someone opens a pidfd for a struct pid a pidfs dentry is allocated > and stashed in pid->stashed. Once the last pidfd for the struct pid is > closed the pidfs dentry is released and removed from pid->stashed. > > So if 10 callers create a pidfs for the same struct pid sequentially, > i.e., each closing the pidfd before the other creates a new one then a > new pidfs dentry is allocated every time. > > Because multiple tasks acquiring and releasing a pidfd for the same > struct pid can race with each another a task may still find a valid > pidfs entry from the previous task in pid->stashed and reuse it. Or it > might find a dead dentry in there and fail to reuse it and so stashes a > new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry > but only one will succeed, the other ones will put their dentry. > > The current scheme aims to ensure that a pidfs dentry for a struct pid > can only be created if the task is still alive or if a pidfs dentry > already existed before the task was reaped and so exit information has > been was stashed in the pidfs inode. > > That's great expect that it's buggy. If a pidfs dentry is stashed in > pid->stashed after pidfs_exit() but before __unhash_process() is called > we will return a pidfd for a reaped task without exit information being > available. > > The pidfds_pid_valid() check does not guard against this race as it > doens't sync at all with pidfs_exit(). The pid_has_task() check might be > successful simply because we're before __unhash_process() but after > pidfs_exit(). > > This switches to a scheme were pidfs entries are retained after a pidfd > was created for the struct pid. So when allocating a pidfds dentry an > extra reference is retained that is owned by the exit path and that will > be put once the task does get reaped. In the new model pidfs dentries > are still allocated on-demand but then kept until the task gets reaped. > > The synchronization mechanism uses the pid->wait_pidfd.lock in struct > pid to synchronize with pidfs_exit() called when the task is reaped. If > the path_from_stashed() fastpath fails, a new pidfs dentry is allocated > and afterwards the pid->wait_pidfd.lock is taken. If no other task > managed to stash its dentry there the callers will be stashed. > > When the task is reaped and calls pidfs_exit() the pid->wait_pidfd.lock > is taken. Once pidfs_exit() holds the pid->wait_pidfd.lock and sees that > no pidfs dentry is available in pid->stashed it knows that no new dentry > can be stashed while it holds the pid->wait_pidfd.lock. It thus sets a > ERR_PTR(-ESRCH) sentinel in pid->stashed. That sentinel allows > pidfs_stash_dentry() to detect that the struct pid has already been > reaped and refuse to stash a new dentry in pid->stashed. That works both > in the fast- and slowpath. > > This in turn allows us to fix the bug mentioned earlier where we hand > out a pidfd for a reaped task without having exit information set as we > now sync with pidfs_exit() and thus release_task(). > > This also has some subtle interactions with the path_from_stashed() > fastpath that need to be considered. The path_from_stashed() fast path > will try go get a reference to an already existing pidfs dentry in > pid->stashed to avoid having to allocate and stash a pidfs dentry. If it > finds a dentry in there it will return it. > > To not confuse path_from_stashed() pidfs_exit() must not replace a pidfs > dentry stashed in pid->stashed with the ERR_PTR(-ESRCH) sentinel as > path_from_stashed() could legitimately obtain another reference before > pidfs_exit() was able to call dput() to put the final pidfs dentry > reference. If it were to put the sentinel into pid->stashed it would > invalidate a struct pid even though a pidfd was just created for it. > > So if a pidfs dentry is stashed in pid->stashed pidfs_exit() must leave > clearing out pid->stashed to dentry->d_prune::pidfs_dentry_prune(). When > pruning a dentry we must take care to not take the pid->wait_pidfd.lock > as this would cause a lock inversion with dentry->d_lock in > pidfs_stash_dentry(). This should fortunately not at all be necessary as > by the time we call pidfs_dentry_prune() we know that the struct pid is > dead as the task is reaped and that anyone concurrently trying to get a > reference to the stashed dentry will fail to do so. > > IOW, it doesn't matter whether the path_from_stashed() fast path sees > NULL, a dead dentry, or the ERR_PTR(-ESRCH) sentinel in pid->stashed. > Any of those forces path_from_stashed() into the slowpath at which point > pid->wait_pidfd.lock must be acquired. The slowpath will then see either > a dead dentry or the ERR_PTR(-ESRCH) sentinel but never NULL and thus > fail the creation of a new pidfs dentry. > > path_from_stashed() must take care to not try and take a reference on > the ERR_PTR(-ESRCH) sentinel. So stashed_dentry_get() must be prepared > to see a ERR_PTR(-ESRCH) sentinel in pid->stashed. > > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > --- > fs/internal.h | 2 + > fs/libfs.c | 22 +++++++-- > fs/pidfs.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++++++------- > kernel/pid.c | 2 +- > 4 files changed, 147 insertions(+), 22 deletions(-) > > diff --git a/fs/internal.h b/fs/internal.h > index 393f6c5c24f6..180b367c192b 100644 > --- a/fs/internal.h > +++ b/fs/internal.h > @@ -322,6 +322,8 @@ struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns); > struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap); > void mnt_idmap_put(struct mnt_idmap *idmap); > struct stashed_operations { > + struct dentry *(*stash_dentry)(struct dentry **stashed, > + struct dentry *dentry); > void (*put_data)(void *data); > int (*init_inode)(struct inode *inode, void *data); > }; > diff --git a/fs/libfs.c b/fs/libfs.c > index 9ea0ecc325a8..f496373869fb 100644 > --- a/fs/libfs.c > +++ b/fs/libfs.c > @@ -2128,6 +2128,8 @@ struct dentry *stashed_dentry_get(struct dentry **stashed) > dentry = rcu_dereference(*stashed); > if (!dentry) > return NULL; > + if (IS_ERR(dentry)) > + return dentry; > if (!lockref_get_not_dead(&dentry->d_lockref)) > return NULL; > return dentry; > @@ -2218,12 +2220,15 @@ static struct dentry *stash_dentry(struct dentry **stashed, > int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data, > struct path *path) > { > - struct dentry *dentry; > + struct dentry *dentry, *res; > const struct stashed_operations *sops = mnt->mnt_sb->s_fs_info; > > /* See if dentry can be reused. */ > - path->dentry = stashed_dentry_get(stashed); > - if (path->dentry) { > + res = stashed_dentry_get(stashed); > + if (IS_ERR(res)) > + return PTR_ERR(res); > + if (res) { > + path->dentry = res; > sops->put_data(data); > goto out_path; > } > @@ -2234,8 +2239,15 @@ int path_from_stashed(struct dentry **stashed, struct vfsmount *mnt, void *data, > return PTR_ERR(dentry); > > /* Added a new dentry. @data is now owned by the filesystem. */ > - path->dentry = stash_dentry(stashed, dentry); > - if (path->dentry != dentry) > + if (sops->stash_dentry) > + res = sops->stash_dentry(stashed, dentry); > + else > + res = stash_dentry(stashed, dentry); > + if (IS_ERR(res)) Shouldn't we do dput(dentry); in here? > + return PTR_ERR(res); > + path->dentry = res; > + /* A dentry was reused. */ > + if (res != dentry) > dput(dentry); > > out_path: > diff --git a/fs/pidfs.c b/fs/pidfs.c > index c1f0a067be40..ee5e9a18c2d3 100644 > --- a/fs/pidfs.c > +++ b/fs/pidfs.c > @@ -25,6 +25,20 @@ > #include "internal.h" > #include "mount.h" > > +/* > + * Usage: > + * pid->wait_pidfd.lock protects: > + * - insertion of dentry into pid->stashed > + * - deletion of pid[PIDTYPE_TGID] task linkage > + * - pidfs_exit() which sets pid->stashed to NULL > + * > + * Ordering: > + * -> dentry->d_lock > + * -> pid->wait_pidfd.lock > + */ > + > +#define PIDFS_PID_DEAD ERR_PTR(-ESRCH) > + > static struct kmem_cache *pidfs_cachep __ro_after_init; > > /* > @@ -552,30 +566,47 @@ struct pid *pidfd_pid(const struct file *file) > * task has been reaped which cannot happen until we're out of > * release_task(). > * > - * If this struct pid is referred to by a pidfd then > - * stashed_dentry_get() will return the dentry and inode for that struct > - * pid. Since we've taken a reference on it there's now an additional > - * reference from the exit path on it. Which is fine. We're going to put > - * it again in a second and we know that the pid is kept alive anyway. > + * If this struct pid has at least once been referred to by a pidfd then > + * pid->stashed will contain a dentry. One reference exclusively belongs > + * to pidfs_exit(). There might of course be other references. > * > * Worst case is that we've filled in the info and immediately free the > - * dentry and inode afterwards since the pidfd has been closed. Since > - * pidfs_exit() currently is placed after exit_task_work() we know that > - * it cannot be us aka the exiting task holding a pidfd to ourselves. > + * dentry and inode afterwards when no one holds an open pidfd anymore. > + * Since pidfs_exit() currently is placed after exit_task_work() we know > + * that it cannot be us aka the exiting task holding a pidfd to itself. > */ > void pidfs_exit(struct task_struct *tsk) > { > struct dentry *dentry; > + struct pid *pid = task_pid(tsk); > > might_sleep(); > > - dentry = stashed_dentry_get(&task_pid(tsk)->stashed); > - if (dentry) { > - struct inode *inode = d_inode(dentry); > - struct pidfs_exit_info *exit_info = &pidfs_i(inode)->__pei; > + scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) { > + struct inode *inode; > + struct pidfs_exit_info *exit_info; > #ifdef CONFIG_CGROUPS > struct cgroup *cgrp; > +#endif > > + dentry = rcu_dereference_raw(pid->stashed); > + if (!dentry) { > + /* > + * No one held a pidfd for this struct pid. > + * Mark it as dead so no one can add a pidfs > + * entry anymore. We're about to be reaped and > + * so no exit information would be available. > + */ > + rcu_assign_pointer(pid->stashed, PIDFS_PID_DEAD); > + return; > + } > + > + /* We own a reference assert that clearly. */ > + VFS_WARN_ON_ONCE(__lockref_is_dead(&dentry->d_lockref)); > + inode = d_inode(dentry); > + exit_info = &pidfs_i(inode)->__pei; > + > +#ifdef CONFIG_CGROUPS > rcu_read_lock(); > cgrp = task_dfl_cgroup(tsk); > exit_info->cgroupid = cgroup_id(cgrp); > @@ -585,8 +616,15 @@ void pidfs_exit(struct task_struct *tsk) > > /* Ensure that PIDFD_GET_INFO sees either all or nothing. */ > smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei); > - dput(dentry); > } > + > + /* > + * If there was a dentry in there we own a reference to it. So > + * accessing pid->stashed is safe. Note, pid->stashed will be > + * cleared by pidfs. Leave it alone as we could end up in a > + * legitimate race with path_from_stashed()'s fast path. > + */ > + dput(dentry); > } > > #ifdef CONFIG_COREDUMP > @@ -683,9 +721,44 @@ static char *pidfs_dname(struct dentry *dentry, char *buffer, int buflen) > return dynamic_dname(buffer, buflen, "anon_inode:[pidfd]"); > } > > +static void pidfs_dentry_prune(struct dentry *dentry) > +{ > + struct dentry **stashed = dentry->d_fsdata; > + struct inode *inode = d_inode(dentry); > + struct pid *pid; > + > + if (WARN_ON_ONCE(!stashed)) > + return; > + > + if (!inode) > + return; > + > + pid = inode->i_private; > + VFS_WARN_ON_ONCE(!pid); > + /* > + * We can't acquire pid->wait_pidfd.lock as we're holding > + * dentry->d_lock and pidfs_stash_dentry() needs to be able to > + * hold dentry->d_lock under pid->wait_pidfd.lock. > + * > + * Also, we don't really need it... > + */ > + VFS_WARN_ON_ONCE(!__lockref_is_dead(&dentry->d_lockref)); > + VFS_WARN_ON_ONCE(*stashed != dentry); > + > + /* > + * ... Because to get here the struct pid must have been reaped > + * already && no one holds a pidfd open anymore. The only one > + * that can race us is pidfd_stash_dentry(). Either it sees a > + * dead dentry in pid->stashed or it sees our sentinel marking > + * the struct pid as dead. There's no need to synchronize this > + * with a lock. > + */ > + smp_store_release(stashed, PIDFS_PID_DEAD); > +} > + > const struct dentry_operations pidfs_dentry_operations = { > .d_dname = pidfs_dname, > - .d_prune = stashed_dentry_prune, > + .d_prune = pidfs_dentry_prune, > }; > > static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len, > @@ -878,9 +951,47 @@ static void pidfs_put_data(void *data) > put_pid(pid); > } > > +static struct dentry *pidfs_stash_dentry(struct dentry **stashed, > + struct dentry *dentry) > +{ > + struct pid *pid = d_inode(dentry)->i_private; > + struct dentry *d; > + > + VFS_WARN_ON_ONCE(stashed != &pid->stashed); > + > + /* We need to synchronize with pidfs_exit(). */ > + guard(spinlock_irq)(&pid->wait_pidfd.lock); > + guard(rcu)(); > + > + d = rcu_dereference(*stashed); > + if (!d) { > + /* > + * There's nothing stashed in here so no pidfs dentry > + * exists for this task yet and we're definitely not > + * past pidfs_exit() and everyone else will wait for us. > + */ > + rcu_assign_pointer(*stashed, dget(dentry)); > + return dentry; > + } > + > + /* The struct pid is already marked dead. */ > + if (IS_ERR(d)) > + return d; > + > + /* > + * The struct pid isn't yet marked dead but there's a dead > + * dentry in there which is just as well. > + */ > + if (!lockref_get_not_dead(&d->d_lockref)) > + return ERR_PTR(-ESRCH); > + > + return d; > +} > + > static const struct stashed_operations pidfs_stashed_ops = { > - .init_inode = pidfs_init_inode, > - .put_data = pidfs_put_data, > + .stash_dentry = pidfs_stash_dentry, > + .init_inode = pidfs_init_inode, > + .put_data = pidfs_put_data, > }; > > static int pidfs_init_fs_context(struct fs_context *fc) > diff --git a/kernel/pid.c b/kernel/pid.c > index 8317bcbc7cf7..30518852d990 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -100,7 +100,7 @@ void put_pid(struct pid *pid) > > ns = pid->numbers[pid->level].ns; > if (refcount_dec_and_test(&pid->count)) { > - WARN_ON_ONCE(pid->stashed); > + WARN_ON_ONCE(!IS_ERR(pid->stashed)); > kmem_cache_free(ns->pid_cachep, pid); > put_pid_ns(ns); > } > > -- > 2.47.2 >