During rcu_read_unlock_special(), if this happens during irq_exit(), we can lockup if an IPI is issued. This is because the IPI itself triggers the irq_exit() path causing a recursive lock up. This is precisely what Xiongfeng found when invoking a BPF program on the trace_tick_stop() tracepoint As shown in the trace below. Fix by using context-tracking to tell us if we're still in an IRQ. context-tracking keeps track of the IRQ until after the tracepoint, so it cures the issues. irq_exit() __irq_exit_rcu() /* in_hardirq() returns false after this */ preempt_count_sub(HARDIRQ_OFFSET) tick_irq_exit() tick_nohz_irq_exit() tick_nohz_stop_sched_tick() trace_tick_stop() /* a bpf prog is hooked on this trace point */ __bpf_trace_tick_stop() bpf_trace_run2() rcu_read_unlock_special() /* will send a IPI to itself */ irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu); A simple reproducer can also be obtained by doing the following in tick_irq_exit(). It will hang on boot without the patch: static inline void tick_irq_exit(void) { + rcu_read_lock(); + WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true); + rcu_read_unlock(); + While at it, add some comments to this code. Reported-by: Xiongfeng Wang <wangxiongfeng2@xxxxxxxxxx> Closes: https://lore.kernel.org/all/9acd5f9f-6732-7701-6880-4b51190aa070@xxxxxxxxxx/ Tested-by: Xiongfeng Wang <wangxiongfeng2@xxxxxxxxxx> Signed-off-by: Joel Fernandes <joelagnelf@xxxxxxxxxx> --- kernel/rcu/tree_plugin.h | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 3c0bbbbb686f..53d8b3415776 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -653,6 +653,9 @@ static void rcu_read_unlock_special(struct task_struct *t) struct rcu_data *rdp = this_cpu_ptr(&rcu_data); struct rcu_node *rnp = rdp->mynode; + // In cases where the RCU-reader is boosted, we'd attempt deboost sooner than + // later to prevent inducing latency to other RT tasks. Also, expedited GPs + // should not be delayed too much. Track both these needs in expboost. expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) || (rdp->grpmask & READ_ONCE(rnp->expmask)) || (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) && @@ -670,10 +673,15 @@ static void rcu_read_unlock_special(struct task_struct *t) // Also if no expediting and no possible deboosting, // slow is OK. Plus nohz_full CPUs eventually get // tick enabled. + // + // Also prevent doing this if context-tracking thinks + // we're handling an IRQ (including when we're exiting + // one -- required to prevent self-IPI deadloops). set_tsk_need_resched(current); set_preempt_need_resched(); if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && - expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) { + expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu) && + !ct_in_irq()) { // Get scheduler to re-evaluate and call hooks. // If !IRQ_WORK, FQS scan will eventually IPI. if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) && -- 2.34.1