On 04/03, Christian Brauner wrote: > > We currently report EINVAL whenever a struct pid has no tasked attached > anymore thereby conflating two concepts: > > (1) The task has already been reaped. > (2) The caller requested a pidfd for a thread-group leader but the pid > actually references a struct pid that isn't used as a thread-group > leader. > > This is causing issues for non-threaded workloads as in [1]. > > This patch tries to allow userspace to distinguish between (1) and (2). > This is racy of course but that shouldn't matter. > > Link: https://github.com/systemd/systemd/pull/36982 [1] > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> For this series: Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx> But I have a couple of cosmetic nits... > int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) > { > - bool thread = flags & PIDFD_THREAD; > + int err = 0; > > - if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID)) > - return -EINVAL; > + if (!(flags & PIDFD_THREAD)) { > + /* > + * If this is struct pid isn't used as a thread-group > + * leader pid but the caller requested to create a > + * thread-group leader pidfd then report ENOENT to the > + * caller as a hint. > + */ > + if (!pid_has_task(pid, PIDTYPE_TGID)) > + err = -ENOENT; > + } > + > + /* > + * If this wasn't a thread-group leader struct pid or the task > + * got reaped in the meantime report -ESRCH to userspace. > + * > + * This is racy of course. This could've not been a thread-group > + * leader struct pid and we set ENOENT above but in the meantime > + * the task got reaped. Or there was a multi-threaded-exec by a > + * subthread and we were a thread-group leader but now got > + * killed. The comment about the multi-threaded-exec looks a bit misleading to me. If this pid is a group-leader-pid and we race with de_thread() which does exchange_tids(tsk, leader); transfer_pid(leader, tsk, PIDTYPE_TGID); nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this transition. hlists_swap_heads_rcu() or hlist_replace_rcu() can't make hlist_head->first == NULL during this transition... Or I misunderstood the comment? And... the code looks a bit overcomplicated to me, why not simply int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) { if (!pid_has_task(pid, PIDTYPE_PID)) return -ESRCH; if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID)) return -ENOENT; return __pidfd_prepare(pid, flags, ret); } ? Of course, the comments should stay. But again, this is cosmetic/subjective, please do what you like more. Oleg.