On Thu, Jul 31, 2025 at 8:02 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Tue, Jul 29, 2025 at 06:51:27PM -0700, Chris Li wrote: > > > They follow a pattern that the original kernel needs to write to the > > device and change the device state. The liveupdate device needs to > > maintain the previous state not changed, therefore needs to prevent > > such write initialization in liveupdate case. > > No, I fundamentally reject this position and your testing methodology. > > The new kernel *should* be writing to config space and it *should* be > doing things like clearing and gaining control over MSI. It is fully > wrong to be blocking it like you are doing just to satify some > incorrect qemu based test checking for no config access. First of all, let me clarify that the PCI PF and VF tests I mention in the cover letter are run on the real data center servers, not qemu. QEMU does not have the correct IOMMU simulation for my workstation anyway. I do use qemu in development to quickly check if I screwed up something badly. The real test is always on the real machine. Our internal test dashboard has reached a high two digit number now, all with real hardware. With that out of the way. Let me explain why we did it the way we did. I believe you and I eventually want the same thing, just different ways to get there. I am also working on a series that allows fine grain control of PCI preservation. It allows the driver to select exactly what needs to be preserved, rather than the current "preserved" vs "depended" control. With the fine grain control, it can basically do what you described, allow new kernel writes to config space they don't want to preserve. However this RFC series is already getting very long, that is why I did not include the fine grain control series in this RFC. Keep in mind that this is just RFC, I want to demonstrate the problem space, and what source code needs to be modified in order to preserve all config space. It is not the final version that gets merged. Your feedback is important to us. My philosophy is that the LUO PCI subsystem is for service of the PCI device driver. Ultimately it is the PCI device driver who decides what part of the config space they want to preserve or overwrite. The PCI layer is just there to facilitate that service. Regarding the testing. There are many different tests we can write and run. Preserving all config space is just one of them. We also have other tests that partially preserve the config space and write to some config as it needs to. That is why I need to have the fine grain control series. If you still think it is unjustifiable to have one test try to preserve all config space for liveupdate. Please elaborate your reasoning. I am very curious. With the fine grained control we let the driver decide what the driver wants to preserve vs not, will that remove your objection? > Only some config accesse are bad. Each and every "bad" one needs to be > clearly explained *why* it is bad and only then mitigated. That is exactly the reason why we have the conservative test that preserves every config space test as a starting point. It does not mean that is the ending point. We also have tests that only partially preserve the config space driver actually needs. When things break, we can quickly compare to find out not preserving which register will break which device. This incremental approach is very effective to deal with very complex devices. Another constraint is that the data center servers are dependent on the network device able to connect to the network appropriately. Take diorite NIC for example, if I try only preserving ATS/PASID did not finish the rest of liveupdate, the nic wasn't able to boot up and connect to the network all the way. Even if the test passes for the ATS part, the over test fails because the server is not back online. I can't include that test into the test dashboard, because it brings down the server. The only way to recover from that is rebooting the server, which takes a long time for a big server. I can only keep that non-passing test as my own private developing test, not the regression test set. That is the reason we to have some conservative tests passing first, then expand to the more risky tests. We are actually quickly expanding our test metrics for doing more and more interesting(and risky) stuff. I hope that clarifies the eventual end goal and the development approach we take. > Most mitigation are far harder than just if'ing around the config > write. My ATS/PASID/etc example for instance. Exactly why we can't add those risky(non working) tests into the dashboard before the conservative passing one. Chris