Oleg, I've been thinking about the multi-threaded exec case where a non-thread-group leader task execs and assumes the thread-group leaders struct pid. Back when we implemented support for PIDFD_THREAD we ended up with the decision that if userspace holds: pidfd_leader_thread = pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD) that exit notification is not strictly defined if a non-thread-group leader thread execs: If poll is called before the exec happened, then an exit notification may be observed and if aftewards no exit notification is generated for the old thread-group leader. Of if exit for the old thread-group leader was observed but poll is called again then it would block again. I was wondering why the following snippet wouldn't work to ensure that PIDFD_THREAD pidfds for thread-group leaders wouldn't be woken with spurious exits: diff --git a/kernel/exit.c b/kernel/exit.c index 9916305e34d3..b79ded1b3bf5 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -745,8 +745,11 @@ static void exit_notify(struct task_struct *tsk, int group_dead) /* * sub-thread or delay_group_leader(), wake up the * PIDFD_THREAD waiters. + * + * The thread-group leader will be taken over by the execing + * task so don't cause spurious wakeups. */ - if (!thread_group_empty(tsk)) + if (!thread_group_empty(tsk) && (tsk->signal->notify_count >= 0)) do_notify_pidfd(tsk); if (unlikely(tsk->ptrace)) { Because that would seem more consistent to me. The downside would be that if userspace performed a series of multi-threaded exec for non-thread-group leader threads then waiters wouldn't get woken. But I think that's probably ok. To handle this case we could later think about whether we can instead start generating a separate poll (POLLPRI?) event when exec happens. I'm probably missing something very obvious why that won't work. Christian