Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 17 Jul 2025 08:30:13 -0700

On Thu, Jul 17, 2025 at 07:57:27AM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 17, 2025 at 6:14 AM Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >
> > On 2025-07-16 18:54, Paul E. McKenney wrote:
> > > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> > >> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> > >>> On Fri, 11 Jul 2025 10:05:26 -0700
> > >>> "Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:
> > >>>
> > >>>> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> > >>>> note that preemption is disabled.  If rcutree.use_softirq is set and
> > >>>> this task is blocking an expedited RCU grace period, it will directly
> > >>>> invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> > >>>> it will directly invoke the non-notrace function irq_work_queue_on().
> > >>>
> > >>> Just to clarify some things; A function annotated by "notrace" simply
> > >>> will not have the ftrace hook to that function, but that function may
> > >>> very well have tracing triggered inside of it.
> > >>>
> > >>> Functions with "_notrace" in its name (like preempt_disable_notrace())
> > >>> should not have any tracing instrumentation (as Mathieu stated)
> > >>> inside of it, so that it can be used in the tracing infrastructure.
> > >>>
> > >>> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> > >>> tracing infrastructure call that, and we happen to enable that
> > >>> tracepoint, we will have:
> > >>>
> > >>>    raise_softirq_irqoff()
> > >>>       trace_softirq_raise()
> > >>>         [..]
> > >>>           raise_softirq_irqoff()
> > >>>              trace_softirq_raise()
> > >>>                 [..]
> > >>>                   Ad infinitum!
> > >>>
> > >>> I'm not sure if that's what is being proposed or not, but I just wanted
> > >>> to make sure everyone is aware of the above.
> > >>
> > >> OK, I *think* I might actually understand the problem.  Maybe.
> > >>
> > >> I am sure that the usual suspects will not be shy about correcting any
> > >> misapprehensions in the following.  ;-)
> > >>
> > >> My guess is that some users of real-time Linux would like to use BPF
> > >> programs while still getting decent latencies out of their systems.
> > >> (Not something I would have predicted, but then again, I was surprised
> > >> some years back to see people with a 4096-CPU system complaining about
> > >> 200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
> > >> made some changes some years back to support this, perhaps most notably
> > >> replacing some uses of preempt_disable() with migrate_disable().
> > >>
> > >> Except that the current __DECLARE_TRACE() macro defeats this work
> > >> for tracepoints by disabling preemption across the tracepoint call,
> > >> which might well be a BPF program.  So we need to do something to
> > >> __DECLARE_TRACE() to get the right sort of protection while still leaving
> > >> preemption enabled.
> > >>
> > >> One way of attacking this problem is to use preemptible RCU.  The problem
> > >> with this is that although one could construct a trace-safe version
> > >> of rcu_read_unlock(), these would negate some optimizations that Lai
> > >> Jiangshan worked so hard to put in place.  Plus those optimizations
> > >> also simplified the code quite a bit.  Which is why I was pushing back
> > >> so hard, especially given that I did not realize that real-time systems
> > >> would be running BPF programs concurrently with real-time applications.
> > >> This meant that I was looking for a functional problem with the current
> > >> disabling of preemption, and not finding it.
> > >>
> > >> So another way of dealing with this is to use SRCU-fast, which is
> > >> like SRCU, but dispenses with the smp_mb() calls and the redundant
> > >> read-side array indexing.  Plus it is easy to make _notrace variants
> > >> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> > >> along with the requisite guards.
> > >>
> > >> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> > >> Remove SRCU protection"), and I have hacked together this and the
> > >> prerequisites mentioned in the previous paragraph.
> > >>
> > >> These are passing ridiculously light testing, but probably have at
> > >> least their share of bugs.
> > >>
> > >> But first, do I actually finally understand the problem?
> > >
> > > OK, they pass somewhat less ridiculously moderate testing, though I have
> > > not yet hit them over the head with the ftrace selftests.
> > >
> > > So might as well post them.
> > >
> > > Thoughts?
> >
> > Your explanation of the problem context fits my understanding.
> >
> > Note that I've mostly been pulled into this by Sebastian who wanted
> > to understand better the how we could make the tracepoint
> > instrumentation work with bpf probes that need to sleep due to
> > locking. Hence my original somewhat high-level desiderata.
> 
> I still don't understand what problem is being solved.
> As current tracepoint code stands there is no issue with it at all
> on PREEMPT_RT from bpf pov.
> bpf progs that attach to tracepoints are not sleepable.
> They don't call rt_spinlock either.
> Recognizing tracepoints that can sleep/fault and allow
> sleepable bpf progs there is on our to do list,
> but afaik it doesn't need any changes to tracepoint infra.
> There is no need to replace existing preempt_disable wrappers
> with sleepable srcu_fast or anything else.

As I understand it, functionally there is no problem.

And if the BPF program has been attached to preemption-disabled code,
there is also no unnecessary damage to real-time latency.  Preemption is
presumably disabled for a reason, and that reason must be respected.
Presumably, people using BPF in real-time systems are careful to either
avoid attaching BPF programs to non-preemptible code on the one hand or
carefully limit the runtime of those BPF programs.

But if the BPF program has been attached to code having preemption
enabled, then the current call to guard(preempt_notrace)() within
__DECLARE_TRACE() means that the entire BPF program is running with
preemption disabled, and for no good reason.

Although this is not a functional issue, it is a first-class bug in
terms of needlessly degrading real-time latency.

And even though it isn't what Sebastian originally asked Mathieu for
help with, it still needs to be fixed.

My current offer is replacing that guard(preempt_notrace)() with its
SRCU-fast counterpart of guard(srcu_fast_notrace)(&tracepoint_srcu).

Light testing is going well thus far, though I clearly need the real-time
guys to both review and test.

I am sure that Mathieu, Sebastian, and the rest won't be shy about
correcting any confusion on my part.  ;-)

							Thanx, Paul