Hello Nam, On 7/15/2025 6:16 PM, Nam Cao wrote: > The ready event list of an epoll object is protected by read-write > semaphore: > > - The consumer (waiter) acquires the write lock and takes items. > - the producer (waker) takes the read lock and adds items. > > The point of this design is enabling epoll to scale well with large number > of producers, as multiple producers can hold the read lock at the same > time. > > Unfortunately, this implementation may cause scheduling priority inversion > problem. Suppose the consumer has higher scheduling priority than the > producer. The consumer needs to acquire the write lock, but may be blocked > by the producer holding the read lock. Since read-write semaphore does not > support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), > we have a case of priority inversion: a higher priority consumer is blocked > by a lower priority producer. This problem was reported in [1]. > > Furthermore, this could also cause stall problem, as described in [2]. > > Fix this problem by replacing rwlock with spinlock. > > This reduces the event bandwidth, as the producers now have to contend with > each other for the spinlock. According to the benchmark from > https://github.com/rouming/test-tools/blob/master/stress-epoll.c: > > On 12 x86 CPUs: > Before After Diff > threads events/ms events/ms > 8 7162 4956 -31% > 16 8733 5383 -38% > 32 7968 5572 -30% > 64 10652 5739 -46% > 128 11236 5931 -47% > > On 4 riscv CPUs: > Before After Diff > threads events/ms events/ms > 8 2958 2833 -4% > 16 3323 3097 -7% > 32 3451 3240 -6% > 64 3554 3178 -11% > 128 3601 3235 -10% > > Although the numbers look bad, it should be noted that this benchmark > creates multiple threads who do nothing except constantly generating new > epoll events, thus contention on the spinlock is high. For real workload, > the event rate is likely much lower, and the performance drop is not as > bad. > > Using another benchmark (perf bench epoll wait) where spinlock contention > is lower, improvement is even observed on x86: > > On 12 x86 CPUs: > Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8 > After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8 > > On 4 riscv CPUs: > Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8 > After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8 > > In conclusion, no one is likely to be upset over this change. After all, > spinlock was used originally for years, and the commit which converted to > rwlock didn't mention a real workload, just that the benchmark numbers are > nice. > > This patch is not exactly the revert of commit a218cc491420 ("epoll: use > rwlock in order to reduce ep_poll_callback() contention"), because git > revert conflicts in some places which are not obvious on the resolution. > This patch is intended to be backported, therefore go with the obvious > approach: > > - Replace rwlock_t with spinlock_t one to one > > - Delete list_add_tail_lockless() and chain_epi_lockless(). These were > introduced to allow producers to concurrently add items to the list. > But now that spinlock no longer allows producers to touch the event > list concurrently, these two functions are not necessary anymore. > > Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") > Signed-off-by: Nam Cao <namcao@xxxxxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx Tested this version with the reproducer that Jan shared in [1] on top of tip:sched/core (PREEMPT_RT) and I didn't run into any rcu-stalls with your patch applied on top (the VM is running the repro for over an hour now and is still responsive). Feel free to include: Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx> [1] https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@xxxxxxxxxxx/ > --- > fs/eventpoll.c | 139 +++++++++---------------------------------------- > 1 file changed, 26 insertions(+), 113 deletions(-) -- Thanks and Regards, Prateek