Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Thu, 17 Jul 2025 15:36:46 -0400

On 2025-07-17 11:18, Paul E. McKenney wrote:
On Thu, Jul 17, 2025 at 10:46:46AM -0400, Mathieu Desnoyers wrote:
On 2025-07-17 09:14, Mathieu Desnoyers wrote:
On 2025-07-16 18:54, Paul E. McKenney wrote:
[...]

2) I think I'm late to the party in reviewing srcu-fast, I'll
     go have a look :)

OK, I'll bite. :) Please let me know where I'm missing something:

Looking at srcu-lite and srcu-fast, I understand that they fundamentally
depend on a trick we published here https://lwn.net/Articles/573497/
"The RCU-barrier menagerie" that allows turning, e.g. this Dekker:

volatile int x = 0, y = 0

CPU 0              CPU 1

x = 1              y = 1
smp_mb             smp_mb
r2 = y             r4 = x

BUG_ON(r2 == 0 && r4 == 0)

into

volatile int x = 0, y = 0

CPU 0            CPU 1

rcu_read_lock()
x = 1              y = 1
                    synchronize_rcu()
r2 = y             r4 = x
rcu_read_unlock()

BUG_ON(r2 == 0 && r4 == 0)

So looking at srcu-fast, we have:

  * Note that both this_cpu_inc() and atomic_long_inc() are RCU read-side
  * critical sections either because they disables interrupts, because they
  * are a single instruction, or because they are a read-modify-write atomic
  * operation, depending on the whims of the architecture.

It appears to be pairing, as RCU read-side:

- irq off/on implied by this_cpu_inc
- atomic
- single instruction

with synchronize_rcu within the grace period, and hope that this behaves as a
smp_mb pairing preventing the srcu read-side critical section from leaking
out of the srcu read lock/unlock.

I note that there is a validation that rcu_is_watching() within
__srcu_read_lock_fast, but it's one thing to have rcu watching, but
another to have an actual read-side critical section. Note that
preemption, irqs, softirqs can very well be enabled when calling
__srcu_read_lock_fast.

My understanding of the how memory barriers implemented with RCU
work is that we need to surround the memory accesses on the fast-path
(where we turn smp_mb into barrier) with an RCU read-side critical
section to make sure it does not spawn across a synchronize_rcu.

What I am missing here is how can a RCU side-side that only consist
of the irq off/on or atomic or single instruction cover all memory
accesses we are trying to order, namely those within the srcu
critical section after the compiler barrier() ? Is having RCU
watching sufficient to guarantee this ?

Good eyes!!!

The trick is that this "RCU read-side critical section" consists only of
either this_cpu_inc() or atomic_long_inc(), with the latter only happening
in systems that have NMIs, but don't have NMI-safe per-CPU operations.
Neither this_cpu_inc() nor atomic_long_inc() can be interrupted, and
thus both act as an interrupts-disabled RCU read-side critical section.

Therefore, if the SRCU grace-period computation fails to see an
srcu_read_lock_fast() increment, its earlier code is guaranteed to
happen before the corresponding critical section.  Similarly, if the SRCU
grace-period computation sees an srcu_read_unlock_fast(), its subsequent
code is guaranteed to happen after the corresponding critical section.

Does that help?  If so, would you be interested and nominating a comment?

Or am I missing something subtle here?

Here is the root of my concern: considering a single instruction
as an RCU-barrier "read-side" for a classic Dekker would not work,
because the read-side would not cover both memory accesses that need
to be ordered.

I cannot help but notice the similarity between this pattern of
barrier vs synchronize_rcu and what we allow userspace to do with
barrier vs sys_membarrier, which has one implementation
based on synchronize_rcu (except for TICK_NOHZ_FULL). Originally
when membarrier was introduced, this was based on synchronize_sched(),
and I recall that this was OK because userspace execution acted as
a read-side critical section from the perspective of synchronize_sched().
As commented in kernel v4.10 near synchronize_sched():

 * Note that this guarantee implies further memory-ordering guarantees.
 * On systems with more than one CPU, when synchronize_sched() returns,
 * each CPU is guaranteed to have executed a full memory barrier since the
 * end of its last RCU-sched read-side critical section whose beginning
 * preceded the call to synchronize_sched().  In addition, each CPU having
 * an RCU read-side critical section that extends beyond the return from
 * synchronize_sched() is guaranteed to have executed a full memory barrier
 * after the beginning of synchronize_sched() and before the beginning of
 * that RCU read-side critical section.  Note that these guarantees include
 * CPUs that are offline, idle, or executing in user mode, as well as CPUs
 * that are executing in the kernel.

So even though I see how synchronize_rcu() nowadays is still a good
choice to implement sys_membarrier, it only apply to RCU read side
critical sections, which covers userspace code and the specific
read-side critical sections in the kernel.

But what I don't get is how synchronize_rcu() can help us promote
the barrier() in SRCU-fast to smp_mb when outside of any RCU read-side
critical section tracked by the synchronize_rcu grace period,
mainly because unlike the sys_membarrier scenario, this is *not*
userspace code.

And what we want to order here on the read-side is the lock/unlock
increments vs the memory accesses within the critical section, but
there is no RCU read-side that contain all those memory accesses
that match those synchronize_rcu calls, so the promotion from barrier
to smp_mb don't appear to be valid.

But perhaps there is something more that is specific to the SRCU
algorithm that I missing here ?

Thanks,

Mathieu

Either way, many thanks for digging into this!!!

							Thanx, Paul

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com