Re: [PATCH v3] eventpoll: Fix priority inversion problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Nam,

On 5/27/2025 2:38 PM, Nam Cao wrote:
The ready event list of an epoll object is protected by read-write
semaphore:

   - The consumer (waiter) acquires the write lock and takes items.
   - the producer (waker) takes the read lock and adds items.

The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.

Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].

Furthermore, this could also cause stall problem, as described in [2].

To fix this problem, make the event list half-lockless:

   - The consumer acquires a mutex (ep->mtx) and takes items.
   - The producer locklessly adds items to the list.

Performance is not the main goal of this patch, but as the producer now can
add items without waiting for consumer to release the lock, performance
improvement is observed using the stress test from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c. This is
the same test that justified using read-write semaphore in the past.

Testing using 12 x86_64 CPUs:

           Before     After        Diff
threads  events/ms  events/ms
       8       6932      19753    +185%
      16       7820      27923    +257%
      32       7648      35164    +360%
      64       9677      37780    +290%
     128      11166      38174    +242%

Testing using 1 riscv64 CPU (averaged over 10 runs, as the numbers are
noisy):

           Before     After        Diff
threads  events/ms  events/ms
       1         73        129     +77%
       2        151        216     +43%
       4        216        364     +69%
       8        234        382     +63%
      16        251        392     +56%


I gave this patch a spin on top of tip:sched/core (PREEMPT_RT) with
Jan's reproducer from
https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@xxxxxxxxxxx/.

On tip:sched/core, I see a hang few seconds into the run and rcu-stall
a minute after when I pin the epoll-stall and epoll-stall-writer on the
same CPU as the Bandwidth timer on a 2vCPU VM. (I'm using a printk to
log the CPU where the timer was started in pinned mode)

With this series, I haven't seen any stalls yet over multiple short
runs (~10min) and even a longer run (~3Hrs).

Feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>

Reported-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider <vschneid@xxxxxxxxxx>
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@xxxxxxxxxxxxxxxxxxx/ [2]
Signed-off-by: Nam Cao <namcao@xxxxxxxxxxxxx>
---
v3:
   - get rid of the "link_used" and "ready" flags. They are hard to
     understand and unnecessary
   - get rid of the obsolete lockdep_assert_irqs_enabled()
   - Add lockdep_assert_held(&ep->mtx)
   - rewrite some comments
v2:
   - rename link_locked -> link_used
   - replace xchg() with smp_store_release() when applicable
   - make sure llist_node is in clean state when not on a list
   - remove now-unused list_add_tail_lockless()

--
Thanks and Regards,
Prateek





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux