Re: [PATCH RFC 20/25] PCI/LUO: Avoid write to liveupdate devices at boot

Chris Li <chrisl@xxxxxxxxxx> · Fri, 1 Aug 2025 16:04:39 -0700

On Thu, Jul 31, 2025 at 8:02 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
>
> On Tue, Jul 29, 2025 at 06:51:27PM -0700, Chris Li wrote:
>
> > They follow a pattern that the original kernel needs to write to the
> > device and change the device state. The liveupdate device needs to
> > maintain the previous state not changed, therefore needs to prevent
> > such write initialization in liveupdate case.
>
> No, I fundamentally reject this position and your testing methodology.
>
> The new kernel *should* be writing to config space and it *should* be
> doing things like clearing and gaining control over MSI. It is fully
> wrong to be blocking it like you are doing just to satify some
> incorrect qemu based test checking for no config access.

First of all, let me clarify that the PCI PF and VF tests I mention in
the cover letter are run on the real data center servers, not qemu.
QEMU does not have the correct IOMMU simulation for my workstation
anyway. I do use qemu in development to quickly check if I screwed up
something badly. The real test is always on the real machine. Our
internal test dashboard has reached a high two digit number now, all
with real hardware.

With that out of the way. Let me explain why we did it the way we did.
I believe you and I eventually want the same thing, just different
ways to get there. I am also working on a series that allows fine
grain control of  PCI preservation. It allows the driver to select
exactly what needs to be preserved, rather than the current
"preserved" vs "depended" control. With the fine grain control, it can
basically do what you described, allow new kernel writes to config
space they don't want to preserve. However this RFC series is already
getting very long, that is why I did not include the fine grain
control series in this RFC. Keep in mind that this is just RFC, I want
to demonstrate the problem space, and what source code needs to be
modified in order to preserve all config space. It is not the final
version that gets merged. Your feedback is important to us.

My philosophy is that the LUO PCI subsystem is for service of the PCI
device driver. Ultimately it is the PCI device driver who decides what
part of the config space they want to preserve or overwrite. The PCI
layer is just there to facilitate that service.

Regarding the testing. There are many different tests we can write and
run. Preserving all config space is just one of them.  We also have
other tests that partially preserve the config space and write to some
config as it needs to. That is why I need to have the fine grain
control series.

If you still think it is unjustifiable to have one test try to
preserve all config space for liveupdate. Please elaborate your
reasoning. I am very curious.
With the fine grained control we let the driver decide what the driver
wants to preserve vs not, will that remove your objection?

> Only some config accesse are bad. Each and every "bad" one needs to be
> clearly explained *why* it is bad and only then mitigated.

That is exactly the reason why we have the conservative test that
preserves every config space test as a starting point. It does not
mean that is the ending point.  We also have tests that only partially
preserve the config space driver actually needs. When things break, we
can quickly compare to find out not preserving which register will
break which device. This incremental approach is very effective to
deal with very complex devices.

Another constraint is that the data center servers are dependent on
the network device able to connect to the network appropriately. Take
diorite NIC  for example, if I try only preserving ATS/PASID did not
finish the rest of liveupdate, the nic wasn't able to boot up and
connect to the network all the way. Even if the test passes for the
ATS part, the over test fails because the server is not back online. I
can't include that test into the test dashboard, because it brings
down the server. The only way to recover from that is rebooting the
server, which takes a long time for a big server. I can only keep that
non-passing test as my own private developing test, not the regression
test set.

That is the reason we to have some conservative tests passing first,
then expand to the more risky tests. We are actually quickly expanding
our test metrics for doing more and more interesting(and risky) stuff.

I hope that clarifies the eventual end goal and the development
approach we take.

> Most mitigation are far harder than just if'ing around the config
> write. My ATS/PASID/etc example for instance.

Exactly why we can't add those risky(non working) tests into the
dashboard before the conservative passing one.

Chris