Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Fri, 28 Mar 2025 12:27:16 +0100

On Thu, Mar 27 2025 at 16:36, Wen Xiong wrote:
> When secondary adapter doing a reset, we use the same code path as 
> removing operation. We can’t free irqs for Secondary adapter since 
> kernel has assigned the irqs for Secondary adapter.
>
> Actually we discussed about "calling pci_free_irq_vectors()" before 
> doing bist reset when we trying to fix in device driver.
> That might cause other problems. It is also not what a user would 
> expect. For example, if they disabled irq balance and manually setup irq 
> binding and affinity, if we go and free and reallocate the interrupts 
> across a reset, this would wipe out those changes, which would not be 
> expected.

You are completely missing the point. This is not a problem restricted
to PCI/MSI interrupts.

You have interrupts, work, queues and whatever in fully operational
state. Then you tell the firmware to reset the adapter and swap the
secondary adapter in. This reset operation brings the hardware temporary
into an undefined state as you have observed. How is any part of the
kernel, which can access and operate on this adapter, supposed to deal
with that correctly? 

You are now focussing solely on papering over the symptom in PCI/MSI
because that's where you actually observed the fallout.

But as you know curing the symptom is not fixing anything because the
underlying root cause still persists and will roar its ugly head at some
other place sooner than later.

So no, instead of sprinkling horrible band-aids all over the place, you
have to sit down and think about a properly orchestrated mechanism for
this failover scenario.

I understand that you want to preserve the state of the secondary
adapter including interrupt numbers and related settings, but for that
you have to properly freeze _all_ related kernel resources first, then
let the firmware do the swap and once that's complete unfreeze
everything again.

That will need support in other subsystems, like the interrupt layer,
but that needs to be properly designed and integrated so that it works
with all possible PCI/MSI backend implementations and under all
circumstances. Setting affinity is not the only way to make this go
wrong, there are other operations which access the underlying hardware
and config space. You just haven't triggered them yet, but they exist.

And it's not a problem restricted to the pSeries MSI backend
implementation either. That's just one way to expose it.

Even if I put the secondary swap problem aside and just look at a single
instance, __ipr_remove() is already broken in itself. As I pointed out
to you before:

    __ipr_remove()
      ipr_initiate_ioa_bringdown() 
	// resets device -> undefined state
	restore_config_space()
      ....
      ipr_free_all_resources()
	free_irqs()

Up to the point where interrupts are freed, they are fully operational
and exposed to user space. The related functionality can hit the
undefined state not only via set_affinity() independent of the swap
problem.

The adapter swap is just a worse variant of the above because it affects
the secondary side behind its back.

This needs quite some work to resolve properly, but that's the only way
to fix it once and forever. Proliferating the 'Plug and Pray' approach
with a lot of handwaving is just helping you to close your ticket, but
it's not a solution.

Thanks,

        tglx