Re: [PATCH RFC 20/25] PCI/LUO: Avoid write to liveupdate devices at boot

Chris Li <chrisl@xxxxxxxxxx> · Tue, 29 Jul 2025 18:51:27 -0700

Hi Thomas,

On Mon, Jul 28, 2025 at 10:23 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jul 28 2025 at 01:24, Chris Li wrote:
> > The liveupdate devices are already initialized by the kernel before the
> > kexec. During the kexec the device is still running. Avoid write to the
> > liveupdate devices during the new kernel boot up.
>
> This change log is way too meager for this kind of change.

I agree with you. I mention it in the cover letter, I do expect this
part of change to be controversial. This RFC series is just to kick
off the discussion for PCI device liveupdate.

>  1) You want to explain in detail how this works.
>     "initialized by the kernel before the kexec" is as vague as it gets.

Agree. Sorry I haven't included more documents in this series. Working on it.

>
>  2) Avoid write ....
>
>     Again this lacks any information how this is supposed to work correctly.

I guess I haven't presented the big picture of how liveupdate works
with a PCI device.

Let's start with the background why we want to do this. We want to
upgrade a host kernel, which has a VM running with a GPU device
attached to the VM via vfio_pci. We want the host kernel upgrade in a
way that the VM can continue without shutting down and restarting the
VM. The VM will pause during the host kernel kexec. The GPU device
will continue running and DMA without pausing. VM will not be able to
run the interrupt until the new kernel is finished booting and resume
the VM.

Pasha's LUO series already have designs on the liveupdate state, with
callback associated with the state.

https://lore.kernel.org/lkml/20250515182322.117840-1-pasha.tatashin@xxxxxxxxxx/

I copy paste some of Pasha's LUO state here:
==========quote==========
LUO State Machine and Events:

NORMAL:   Default operational state.
PREPARED: Initial preparation complete after LIVEUPDATE_PREPARE
          event. Subsystems have saved initial state.
FROZEN:   Final "blackout window" state after LIVEUPDATE_FREEZE
          event, just before kexec. Workloads must be suspended.
UPDATED:  Next kernel has booted via live update. Awaiting restoration
          and LIVEUPDATE_FINISH.

Events:
LIVEUPDATE_PREPARE: Prepare for reboot, serialize state.
LIVEUPDATE_FREEZE:  Final opportunity to save state before kexec.
LIVEUPDATE_FINISH:  Post-reboot cleanup in the next kernel.
LIVEUPDATE_CANCEL:  Abort prepare or freeze, revert changes.
==========quote ends ===========

The PCI core register will register as a subsystem to LUO and
participate in the LUO callbacks.
1) In NORMAL state:
The PCI device will register to the PCI subsystem by setting the
pci_dev->dev.lu.requested flag.

2) PREPARE callback. The PCI subsystem will build the list of the PCI
devices using the PCI device dependency. VF depends on PF, PCI devices
depend on the parent bridge.

The PCI subsystem will save the struct pci_dev part of the pci device
state. Then forward the prepare callback to the PCI devices to
serialize the PCI devices driver state. The VM  is still running but
with some limitations. e.g. can't create new DMA mapping. can't attach
to an additional new vfio_pci device.

3) FREEZE callback: VM is paused. Last change for PCI device to
serialize the device state.

4) kexec booting up the new kernel.

5) PCI device enumeration and probing. Find the PCI device in the
serialized preserved device list, restore the device serialized data
pointer for PCI device. PF device probe(), restores the number of  VF
and creates the VF, the VF device probe()

6) VM re-attach to the requested PCI device via vfio_pci.

7) FINISH callback. PCI subsystem and PCI devices free their preserved
serialized data. System go back to NORMAL state.

8) VM resume running.

>
> >  drivers/pci/ats.c            |  7 ++--
> >  drivers/pci/iov.c            | 58 ++++++++++++++++++------------
> >  drivers/pci/msi/msi.c        | 32 ++++++++++++-----
> >  drivers/pci/msi/pcidev_msi.c |  4 +--
> >  drivers/pci/pci-acpi.c       |  3 ++
> >  drivers/pci/pci.c            | 85 +++++++++++++++++++++++++++++---------------
> >  drivers/pci/pci.h            |  9 ++++-
> >  drivers/pci/pcie/aspm.c      |  7 ++--
> >  drivers/pci/pcie/pme.c       | 11 ++++--
> >  drivers/pci/probe.c          | 43 +++++++++++++++-------
> >  drivers/pci/setup-bus.c      | 10 +++++-
>
> Then you sprinkle this stuff into files, which have completely different
> purposes, without any explanation for the particular instances why they
> are supposed to be correct and how this works.

They follow a pattern that the original kernel needs to write to the
device and change the device state. The liveupdate device needs to
maintain the previous state not changed, therefore needs to prevent
such write initialization in liveupdate case.

I can certainly split it into more patches and group them by functions
in the later series.
This patch does it in a whole sale just to demonstrate what needs to
happen to make a device live update.

>
> I'm just looking at the MSI parts, as I have no expertise with the rest.

Thank you for your feedback, that is very helpful.

>
> > diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> > index 6ede55a7c5e652c80b51b10e58f0290eb6556430..7c40fde1ba0f89ad1d72064ac9e80696faeab426 100644
> > --- a/drivers/pci/msi/msi.c
> > +++ b/drivers/pci/msi/msi.c
> > @@ -113,7 +113,8 @@ static int pci_setup_msi_context(struct pci_dev *dev)
> >
> >  void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
> >  {
> > -     raw_spinlock_t *lock = &to_pci_dev(desc->dev)->msi_lock;
> > +     struct pci_dev *pci_dev = to_pci_dev(desc->dev);
> > +     raw_spinlock_t *lock = &pci_dev->msi_lock;
> >       unsigned long flags;
> >
> >       if (!desc->pci.msi_attrib.can_mask)
> > @@ -122,8 +123,9 @@ void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
> >       raw_spin_lock_irqsave(lock, flags);
> >       desc->pci.msi_mask &= ~clear;
> >       desc->pci.msi_mask |= set;
> > -     pci_write_config_dword(msi_desc_to_pci_dev(desc), desc->pci.mask_pos,
> > -                            desc->pci.msi_mask);
> > +     if (!pci_lu_adopt(pci_dev))
> > +             pci_write_config_dword(pci_dev, desc->pci.mask_pos,
> > +                                    desc->pci.msi_mask);
>
> This results in inconsistent state, which is a bad idea to begin
> with. How is cached software state and hardware state going to be
> brought in sync at some point?

Yes, to make the interrupt fully working we need to tell the new
kernel about the previous kernel's interrupt descriptor in IOMMU etc.
As it is, the liveupdate device interrupt is not fully working yet.
David is working on the interrupt and later there will be an interrupt
series to make interrupt working with liveupdate devices. This is just
the first baby step.

>
> If you analyzed all places, which actually depend on hardware state and
> make decisions based on it, for correctness, then you failed to provide
> that analysis. If not, no comment.

Let me clarify. This avoid writing to devices only applies to
liveupdate devices. Only between FREEZE and FINISH. After the LUO
finish(), LUO is back to normal state again. The device can be
writable again as normal, most likely by the VM. We don't want the
device state to change between FREEZE and FINISH.

>
> >       raw_spin_unlock_irqrestore(lock, flags);
> >  }
> >
> > @@ -190,6 +192,9 @@ static inline void pci_write_msg_msi(struct pci_dev *dev, struct msi_desc *desc,
> >       int pos = dev->msi_cap;
> >       u16 msgctl;
> >
> > +     if (pci_lu_adopt(dev))
> > +             return;
> > +
> >       pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
> >       msgctl &= ~PCI_MSI_FLAGS_QSIZE;
> >       msgctl |= FIELD_PREP(PCI_MSI_FLAGS_QSIZE, desc->pci.msi_attrib.multiple);
> > @@ -214,6 +219,8 @@ static inline void pci_write_msg_msix(struct msi_desc *desc, struct msi_msg *msg
> >
> >       if (desc->pci.msi_attrib.is_virtual)
> >               return;
> > +     if (pci_lu_adopt(to_pci_dev(desc->dev)))
> > +             return;
>
> So you don't allow the new kernel to write the MSI message, but the
> interrupt subsystem has this new message and there are places which
> utilize that cached message. How is this supposed to work?

We don't allow the PCI subsystem or driver to write the MSI message
before FINISH.
There are two possible ways. 1) Have someone save the incoming MSI
message somehow, and re-deliver them after the FINISH call. 2) Don't
save the MSI message between FREEZE and FINISH. At finish, deliver one
spurious interrupt to the device driver, so the device driver can have
a chance to check if there is any pending work that needs to be done.
It is possible that no MSI has been dropped, the driver finds out
there is nothing that needs to be done. We expect the driver can
tolerate such one time spurious interruptions. Because spurious
interruptions can happen for other reasons, that should be fine? Let
me know if there is a case where this kind of spurious interrupt can
cause a problem, we are very interested to know.

>
> >       /*
> >        * The specification mandates that the entry is masked
> >        * when the message is modified:
> > @@ -279,7 +286,8 @@ static void pci_msi_set_enable(struct pci_dev *dev, int enable)
> >       control &= ~PCI_MSI_FLAGS_ENABLE;
> >       if (enable)
> >               control |= PCI_MSI_FLAGS_ENABLE;
> > -     pci_write_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, control);
> > +     if (!pci_lu_adopt(dev))
> > +             pci_write_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, control);
>
> The placement of these conditionals is arbitrary. Some are the begin of
> a function, others just block the write. Is that based on some logic or
> were the places selected by shabby AI queries?
They all can be converted to the pattern as:
if (!pci_luo_adopt(dev))
      pci_write_config_xxx().

Sometimes I choose to return early if there is multiple write but not
data stored in struct pci_dev. Mostly just try to reduce the number of
if (!pci_luo_adopt(dev)). I am not satisfied with this change yet. The
goal of this patch is to show what effect needs to happen, we can
discuss better ways to do it.

>
> >  static int msi_setup_msi_desc(struct pci_dev *dev, int nvec,
> > @@ -553,6 +561,7 @@ static void pci_msix_clear_and_set_ctrl(struct pci_dev *dev, u16 clear, u16 set)
> >  {
> >       u16 ctrl;
> >
> > +     BUG_ON(pci_lu_adopt(dev));
>
> Not going to happen. BUG() is only appropriate when there is absolutely
> no way to handle a situation. This is as undocumented as everything else
> here.

Agree. This is some developing/debug stuff left over. I haven't
encountered msix_clear_and_set_ctrl() in my test. I will remove the
bug in the next version.

>
> >       pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &ctrl);
> >       ctrl &= ~clear;
> >       ctrl |= set;
> > @@ -720,8 +729,9 @@ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
> >        * registers can be accessed.  Mask all the vectors to prevent
> >        * interrupts coming in before they're fully set up.
> >        */
> > -     pci_msix_clear_and_set_ctrl(dev, 0, PCI_MSIX_FLAGS_MASKALL |
> > -                                 PCI_MSIX_FLAGS_ENABLE);
> > +     if (!pci_lu_adopt(dev))
> > +             pci_msix_clear_and_set_ctrl(dev, 0, PCI_MSIX_FLAGS_MASKALL |
> > +                                         PCI_MSIX_FLAGS_ENABLE);
>
> And for enhanced annoyance you sprinkle this condition everywhere into
> the code and then BUG() when you missed an instance. Because putting it
> into the function which is invoked a gazillion of times would be too
> obvious, right? That would at least be tasteful, but that's not the
> primary problem of all this.
>
> Sprinkling these conditionals all over the place is absolutely
> unmaintainable, error prone and burdens everyone with this insanity and
> the related hard to chase bugs.

If you prefer, I can move them all into the pci_config_write. We
actually start with pci_config_write_xxx(). But that solution has its
own problem as well.  For starters, the function name does not reflect
what the function actually does any more. Also for the complicated
case, where liveupdate does need to write some config register but not
the other. e.g. From the live update point of view, PF devices
shouldn't write to SR-IOV related registers that change the VF devices
number. But PF devices should be able to tolerate some other config
space write, because the VM is not using the PF device. The PF device
state can be changed without impacting the VM.
It is going to be unmaintainable to make a complicated logic inside
pci_config_write_xxx(), depending on which caller and what state, what
is allowed and what is not.

I can discuss and try different approaches to address this problem. I
understand it is a hard problem. I don't have a perfect solution
without cons. This is just the first baby step to demonstrate what is
the resulting effect we want. Then we can shape the code to our
liking. I am happy to explore other approaches as well.

>
> Especially as there is no concept behind this and zero documentation how
> any of this should work or even be remotely correct.

I hope the above description can help you understand better why we
want to do it and the approach we take. I am happy to answer questions
if you have any. Mind you that I don't have all the answers. It is
part of the journey to find the best solution.
>
> Before you start the next hackery, please sit down and write up coherent
> explanations:
>
>   What is the general concept of this?

See above.

>
>   What is the exact state in which a device is left when the old kernel
>   jumps into the new kernel?

The device allows DMA to the mapping region during PREPARE and raise
interrupt. The interrupt handler will not be able to run during kexec
black out period (between freeze and finish). Other than the state
store in the device, there is also a PCI subsystem and device driver
state serialized in the preserved folio for the next kernel to
interrupt.

>   What is the state of the MSI[-X] or legacy PCI interrupts at this
>   point?

The current approach is that, just drop the interrupt during black out
period (between freeze and finish) then deliver a spurious interrupt
to the device at finish(), that gives the device driver a chance to
perform the interrupt handler action which can't happen in black out.

>
>   Can the device raise interrupts during the transition from the old to
>   the new kernel?

Yes, can raise interrupt but interrupt handle won't able to run during
black out.
After finish() it is business as usual.
>
>   How is the "live" state of the device reflected and restored
>   throughout the interrupt subsystem?

Those are very good questions. Current approach just drop them and use
the spurious interrupt to catch up in the end.
>
>   How is the device driver supposed to attach to the same interrupt
>   state as before?

We can't if we did not save the interrupt state changed during black
out. Current approach is just using a spurious interrupt to catch up
in finish().

>
>   How are the potentially different Linux interrupt numbers mapped to
>   the previous state?
The IRQ number will remain the state cross kexec. However the
interrupt descriptor address might have changed in the new kernel. We
need to save some of the interrupt descriptor and interrupt state into
the preserved folio for the next kernel to rebuild. To be continued in
the interrupt series. Not covered by this patch series yet.
>
> Before this materializes and is agreed on, this is not going anywhere.

Those are very good questions. Hopefully I have answered some of it.
Please let me know if you have more questions I can clarify.

Again this is just an RFC to show what was the resulting effect we
want to get from the PCI device livedupate. It is not complete nor
perfect. I am happy to explore different approaches.

Thanks for the questions. I still owe you a write up document for the
PCI device liveupdate. I will work on that.

Hope that helps explain some of the background and approach. It is not
a substitution of the document. I am working on that and will include
it in the next version.

Chris

Chris