On Mon, 2025-06-30 at 20:38 +0530, K Prateek Nayak wrote: > Hello Nam, > > On 5/27/2025 2:38 PM, Nam Cao wrote: > > The ready event list of an epoll object is protected by read-write > > semaphore: > > > > - The consumer (waiter) acquires the write lock and takes items. > > - the producer (waker) takes the read lock and adds items. > > > > The point of this design is enabling epoll to scale well with large number > > of producers, as multiple producers can hold the read lock at the same > > time. > > > > Unfortunately, this implementation may cause scheduling priority inversion > > problem. Suppose the consumer has higher scheduling priority than the > > producer. The consumer needs to acquire the write lock, but may be blocked > > by the producer holding the read lock. Since read-write semaphore does not > > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), > > we have a case of priority inversion: a higher priority consumer is blocked > > by a lower priority producer. This problem was reported in [1]. > > > > Furthermore, this could also cause stall problem, as described in [2]. > > > > To fix this problem, make the event list half-lockless: > > > > - The consumer acquires a mutex (ep->mtx) and takes items. > > - The producer locklessly adds items to the list. > > > > Performance is not the main goal of this patch, but as the producer now can > > add items without waiting for consumer to release the lock, performance > > improvement is observed using the stress test from > > https://github.com/rouming/test-tools/blob/master/stress-epoll.c. This is > > the same test that justified using read-write semaphore in the past. > > > > Testing using 12 x86_64 CPUs: > > > > Before After Diff > > threads events/ms events/ms > > 8 6932 19753 +185% > > 16 7820 27923 +257% > > 32 7648 35164 +360% > > 64 9677 37780 +290% > > 128 11166 38174 +242% > > > > Testing using 1 riscv64 CPU (averaged over 10 runs, as the numbers are > > noisy): > > > > Before After Diff > > threads events/ms events/ms > > 1 73 129 +77% > > 2 151 216 +43% > > 4 216 364 +69% > > 8 234 382 +63% > > 16 251 392 +56% > > > > I gave this patch a spin on top of tip:sched/core (PREEMPT_RT) with > Jan's reproducer from > https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@xxxxxxxxxxx/. > > On tip:sched/core, I see a hang few seconds into the run and rcu-stall > a minute after when I pin the epoll-stall and epoll-stall-writer on the > same CPU as the Bandwidth timer on a 2vCPU VM. (I'm using a printk to > log the CPU where the timer was started in pinned mode) > > With this series, I haven't seen any stalls yet over multiple short > runs (~10min) and even a longer run (~3Hrs). Many thanks for running those tests and posting the results as comments to this series. Highly appreciated! Florian