On Thu, Apr 10, 2025 at 03:10:09PM +0200, Oleg Nesterov wrote: > On 04/10, Christian Brauner wrote: > > > > On Thu, Apr 10, 2025 at 12:18:01PM +0200, Oleg Nesterov wrote: > > > On 04/09, Oleg Nesterov wrote: > > > > > > > > On 04/09, Christian Brauner wrote: > > > > > > > > > > The seqcounter might be > > > > > useful independent of pidfs. > > > > > > > > Are you sure? ;) to me the new pid->pid_seq needs more justification... > > > > Yeah, pretty much. I'd make use of this in other cases where we need to > > detect concurrent changes to struct pid without having to take any > > locks. Multi-threaded exec in de_exec() comes to mind as well. > > Perhaps you are right, but so far I am still not sure it makes sense. > And we can always add it later if we have another (more convincing) > use-case. > > > > To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and > > > takes pid->wait_pidfd->lock. > > > > > > So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID) > > > is not possible until we drop pid->wait_pidfd->lock. > > > > > > If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(), > > > pid_has_task(PIDTYPE_PID) can't succeed. > > > > I know. I was trying to avoid having to take the lock and just make this > > lockless. But if you think we should use this lock here instead I'm > > willing to do this. I just find the sequence counter more elegant than > > the spin_lock_irq(). > > This is subjective, and quite possibly I am wrong. But yes, I'd prefer > to (ab)use pid->wait_pidfd->lock in pidfd_prepare() for now and not > penalize __unhash_process(). Simply because this is simpler. > > If you really dislike taking wait_pidfd->lock, we can add mb() into > __unhash_process() or even smp_mb__after_spinlock() into __change_pid(), > but this will need a lengthy comment... No, I don't think we should do that. > As for your patch... it doesn't apply on top of 3/4, but I guess it > is clear what does it do, and (unfortunately ;) it looks correct, so > I won't insist too much. See a couple of nits below. > > > this imho and it would give pidfds a reliable way to detect relevant > > concurrent changes locklessly without penalizing other critical paths > > (e.g., under tasklist_lock) in the kernel. > > Can't resist... Note that raw_seqcount_begin() in pidfd_prepare() will > take/drop tasklist_lock if it races with __unhash_process() on PREEMPT_RT. Eeeeew, if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \ return seq; \ \ if (preemptible && unlikely(seq & 1)) { \ __SEQ_LOCK(lockbase##_lock(s->lock)); \ __SEQ_LOCK(lockbase##_unlock(s->lock)); \ priority inversion fix, I take it. That's equally ugly as what we had to do for mnt_get_write_access()... I actually think what you just pointed out is rather problematic. It's absolutely wild that raw_seqcount_begin() suddenly implies locking. How isn't that a huge landmine? On non-rt I can happily do: acquire_associated_lock() raw_seqcount_begin() drop_associated_lock() But this will immediately turn into a deadlock on preempt-rt, no? > Yes, this is unlikely case, but still... > > Now. Unless I misread your patch, pidfd_prepare() does "err = 0" only > once before the main loop. And this is correct. But this means that > we do not need the do/while loop. Yes, I know. I simply used the common idiom. > > If read_seqcount_retry() returns true, we can safely return -ESRCH. So > we can do > > seq = raw_seqcount_begin(&pid->pid_seq); > > if (!PIDFD_THREAD && !pid_has_task(PIDTYPE_TGID)) > err = -ENOENT; > > if (!pid_has_task(PIDTYPE_PID)) > err = -ESRCH; > > if (read_seqcount_retry(pid->pid_seq, seq)) > err = -ESRCH; > > In fact we don't even need raw_seqcount_begin(), we could use > raw_seqcount_try_begin(). > > And why seqcount_rwlock_t? A plain seqcount_t can equally work. Yes, but this way its dependence on tasklist_lock is natively integrated with lockdep afaict: * typedef seqcount_LOCKNAME_t - sequence counter with LOCKNAME associated * @seqcount: The real sequence counter * @lock: Pointer to the associated lock * * A plain sequence counter with external writer synchronization by * LOCKNAME @lock. The lock is associated to the sequence counter in the * static initializer or init function. This enables lockdep to validate * that the write side critical section is properly serialized. * * LOCKNAME: raw_spinlock, spinlock, rwlock or mutex */ /* * seqcount_LOCKNAME_init() - runtime initializer for seqcount_LOCKNAME_t * @s: Pointer to the seqcount_LOCKNAME_t instance * @lock: Pointer to the associated lock */ #define seqcount_LOCKNAME_init(s, _lock, lockname) \ do { \ seqcount_##lockname##_t *____s = (s); \ seqcount_init(&____s->seqcount); \ __SEQ_LOCK(____s->lock = (_lock)); \ } while (0) #define seqcount_raw_spinlock_init(s, lock) seqcount_LOCKNAME_init(s, lock, raw_spinlock) #define seqcount_spinlock_init(s, lock) seqcount_LOCKNAME_init(s, lock, spinlock) #define seqcount_rwlock_init(s, lock) seqcount_LOCKNAME_init(s, lock, rwlock) #define seqcount_mutex_init(s, lock) seqcount_LOCKNAME_init(s, lock, mutex) > And, if we use seqcount_rwlock_t, > > lockdep_assert_held_write(&tasklist_lock); > ... > raw_write_seqcount_begin(pid->pid_seq); > > in __unhash_process() looks a bit strange. I'd suggest to use > write_seqcount_begin() which does seqprop_assert() and kill > lockdep_assert_held_write(). > > Oleg. >