Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 17 Jul 2025 07:57:27 -0700

On Thu, Jul 17, 2025 at 6:14 AM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> On 2025-07-16 18:54, Paul E. McKenney wrote:
> > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
> >> On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
> >>> On Fri, 11 Jul 2025 10:05:26 -0700
> >>> "Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:
> >>>
> >>>> This trace point will invoke rcu_read_unlock{,_notrace}(), which will
> >>>> note that preemption is disabled.  If rcutree.use_softirq is set and
> >>>> this task is blocking an expedited RCU grace period, it will directly
> >>>> invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
> >>>> it will directly invoke the non-notrace function irq_work_queue_on().
> >>>
> >>> Just to clarify some things; A function annotated by "notrace" simply
> >>> will not have the ftrace hook to that function, but that function may
> >>> very well have tracing triggered inside of it.
> >>>
> >>> Functions with "_notrace" in its name (like preempt_disable_notrace())
> >>> should not have any tracing instrumentation (as Mathieu stated)
> >>> inside of it, so that it can be used in the tracing infrastructure.
> >>>
> >>> raise_softirq_irqoff() has a tracepoint inside of it. If we have the
> >>> tracing infrastructure call that, and we happen to enable that
> >>> tracepoint, we will have:
> >>>
> >>>    raise_softirq_irqoff()
> >>>       trace_softirq_raise()
> >>>         [..]
> >>>           raise_softirq_irqoff()
> >>>              trace_softirq_raise()
> >>>                 [..]
> >>>                   Ad infinitum!
> >>>
> >>> I'm not sure if that's what is being proposed or not, but I just wanted
> >>> to make sure everyone is aware of the above.
> >>
> >> OK, I *think* I might actually understand the problem.  Maybe.
> >>
> >> I am sure that the usual suspects will not be shy about correcting any
> >> misapprehensions in the following.  ;-)
> >>
> >> My guess is that some users of real-time Linux would like to use BPF
> >> programs while still getting decent latencies out of their systems.
> >> (Not something I would have predicted, but then again, I was surprised
> >> some years back to see people with a 4096-CPU system complaining about
> >> 200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
> >> made some changes some years back to support this, perhaps most notably
> >> replacing some uses of preempt_disable() with migrate_disable().
> >>
> >> Except that the current __DECLARE_TRACE() macro defeats this work
> >> for tracepoints by disabling preemption across the tracepoint call,
> >> which might well be a BPF program.  So we need to do something to
> >> __DECLARE_TRACE() to get the right sort of protection while still leaving
> >> preemption enabled.
> >>
> >> One way of attacking this problem is to use preemptible RCU.  The problem
> >> with this is that although one could construct a trace-safe version
> >> of rcu_read_unlock(), these would negate some optimizations that Lai
> >> Jiangshan worked so hard to put in place.  Plus those optimizations
> >> also simplified the code quite a bit.  Which is why I was pushing back
> >> so hard, especially given that I did not realize that real-time systems
> >> would be running BPF programs concurrently with real-time applications.
> >> This meant that I was looking for a functional problem with the current
> >> disabling of preemption, and not finding it.
> >>
> >> So another way of dealing with this is to use SRCU-fast, which is
> >> like SRCU, but dispenses with the smp_mb() calls and the redundant
> >> read-side array indexing.  Plus it is easy to make _notrace variants
> >> srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
> >> along with the requisite guards.
> >>
> >> Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
> >> Remove SRCU protection"), and I have hacked together this and the
> >> prerequisites mentioned in the previous paragraph.
> >>
> >> These are passing ridiculously light testing, but probably have at
> >> least their share of bugs.
> >>
> >> But first, do I actually finally understand the problem?
> >
> > OK, they pass somewhat less ridiculously moderate testing, though I have
> > not yet hit them over the head with the ftrace selftests.
> >
> > So might as well post them.
> >
> > Thoughts?
>
> Your explanation of the problem context fits my understanding.
>
> Note that I've mostly been pulled into this by Sebastian who wanted
> to understand better the how we could make the tracepoint
> instrumentation work with bpf probes that need to sleep due to
> locking. Hence my original somewhat high-level desiderata.

I still don't understand what problem is being solved.
As current tracepoint code stands there is no issue with it at all
on PREEMPT_RT from bpf pov.
bpf progs that attach to tracepoints are not sleepable.
They don't call rt_spinlock either.
Recognizing tracepoints that can sleep/fault and allow
sleepable bpf progs there is on our to do list,
but afaik it doesn't need any changes to tracepoint infra.
There is no need to replace existing preempt_disable wrappers
with sleepable srcu_fast or anything else.