Re: [PATCH v5 07/16] kexec: add Kexec HandOver (KHO) generation helpers

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 24 Mar 2025 23:20:51 -0300

On Mon, Mar 24, 2025 at 05:21:45PM -0700, Changyuan Lyu wrote:

> Thanks for the suggestions! I am a little bit concerned about assuming
> every FDT fragment is smaller than PAGE_SIZE. In case a child FDT is
> larger than PAGE_SIZE, I would like to turn the single u64 in the parent
> FDT into a u64 list to record all the underlying pages of the child FDT.

Maybe, but I'd suggest leaving some accomodation for this in the API
but not implement it until we see proof it is needed. 4k is alot of
space for a FDT, and if you are doing per-object FDT I don't see
exceeding it.

For instance a vfio, memfd, and iommufd object FDTs would not get
close.

> In this way we assume that most FDT fragment is smaller than 1 page so
> "kho,recursive-fdt" is usually just 1 u64, but we can also handle
> larger fragments if that really happens.

Yes, this is close to what I imagine.

You have to decide if the child FDT top pointers will be stored
directly in parent FDTs like you sketched above, or if they should be
stored in some dedicated allocated and preserved datastructure, like
the memory preservation works. There are some tradeoffs in each
direction..

> I also allow KHO users to add sub nodes in-place, instead of forcing
> to create a new FDT fragment for every sub node, if the KHO user is
> confident that those subnodes are small enough to fit in the parent
> node's page. In this way we do not need to waste a full page for a small
> sub node. An example is the "memblock" node above.

Well, I think that sort of misses the bigger picture. What we want is
to run serialization of everything in parallel. So merging like you
say will complicate that.

Really, I think we will have on the order of 10's of objects to
serialize so I don't really care if they use partial pages if that
makes the serialization faster. As long as the memory is freed once
the live update is done, the waste doesn't matter.

> Finally, the KHO top level FDT may also be larger than 1 page, this can
> be handled using the anchor-page method discussed in the previous mails.

This is one of the trade offs I mentioned. If you inline the objects
as FDT nodes then you have to scale and multi-page a FDT.

If you do a binary-structure like memory preservation then you have to
serialize to something that is inherently scalable and 4k granular.

The 4k FDT limit really only works if you make liberal use of pointers
to binary data. Anything that is not of a predictable size limit would
be in some related binary structure.

So.. I'd probably suggest to think about how to make multi-page FDT
work in the memory description, but not implement it now. When we
reach the point where we know we need multi-page FDT then someone
would have to implement a growable FDT through vmap or something like
that to make it work.

Keep this intial step simple, we clearly don't need more than 4k FDT
at this point and we aren't doing stable kexec-ABI either. So simplify
simplify simplify to get a very thin minimal functionality merged to
put the fdbox step on top of.

Jason