Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface

David Matlack <dmatlack@xxxxxxxxxx> · Thu, 17 Jul 2025 09:17:17 -0700

On Mon, Jul 14, 2025 at 7:56 AM Pratyush Yadav <pratyush@xxxxxxxxxx> wrote:
> On Thu, Jun 26 2025, David Matlack wrote:
> > On Thu, Jun 26, 2025 at 8:42 AM Pratyush Yadav <pratyush@xxxxxxxxxx> wrote:
> >> On Wed, Jun 25 2025, David Matlack wrote:
> >> > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >> >> >
> >> >> > While I agree that a filesystem offers superior introspection and
> >> >> > integration with standard tools, building this complex, stateful
> >> >> > orchestration logic on top of VFS seemed to be forcing a square peg
> >> >> > into a round hole. The ioctl interface, while more opaque, provides a
> >> >> > direct and explicit way to command the state machine and manage these
> >> >> > complex lifecycle and dependency rules.
> >> >>
> >> >> I'm not going to argue that you have to switch to this kexecfs idea
> >> >> but...
> >> >>
> >> >> You're using a character device that's tied to devmptfs. In other words,
> >> >> you're already using a filesystem interface. Literally the whole code
> >> >> here is built on top of filesystem APIs. So this argument is just very
> >> >> wrong imho. If you can built it on top of a character device using VFS
> >> >> interfaces you can do it as a minimal filesystem.
> >> >>
> >> >> You're free to define the filesystem interface any way you like it. We
> >> >> have a ton of examples there. All your ioctls would just be tied to the
> >> >> fileystem instance instead of the /dev/somethingsomething character
> >> >> device. The state machine could just be implemented the same way.
> >> >>
> >> >> One of my points is that with an fs interface you can have easy state
> >> >> seralization on a per-service level. IOW, you have a bunch of virtual
> >> >> machines running as services or some networking services or whatever.
> >> >> You could just bind-mount an instance of kexecfs into the service and
> >> >> the service can persist state into the instance and easily recover it
> >> >> after kexec.
> >> >
> >> > This approach sounds worth exploring more. It would avoid the need for
> >> > a centralized daemon to mediate the preservation and restoration of
> >> > all file descriptors.
> >>
> >> One of the jobs of the centralized daemon is to decide the _policy_ of
> >> who gets to preserve things and more importantly, make sure the right
> >> party unpreserves the right FDs after a kexec. I don't see how this
> >> interface fixes this problem. You would still need a way to identify
> >> which kexecfs instance belongs to who and enforce that. The kernel
> >> probably shouldn't be the one doing this kind of policy so you still
> >> need some userspace component to make those decisions.
> >
> > The main benefits I see of kexecfs is that it avoids needing to send
> > all FDs over UDS to/from liveupdated and therefore the need for
> > dynamic cross-process communication (e.g. RPCs).
> >
> > Instead, something just needs to set up a kexecfs for each VM when it
> > is created, and give the same kexecfs back to each VM after kexec.
> > Then VMs are free to save/restore any FDs in that kexecfs without
> > cross-process communication or transferring file descriptors.
>
> Isn't giving back the right kexecfs instance to the right VMM the main
> problem? After a kexec, you need a way to make that policy decision. You
> would need a userspace agent to do that.
>
> I think what you are suggesting does make a lot of sense -- the agent
> should be handing out sessions instead of FDs, which would make FD
> save/restore simpler for applications. But that can be done using the
> ioctl interface as well. Each time you open() the /dev/liveupdate, you
> get a new session. Instead of file FDs like memfd or iommufs, we can
> have the agent hand out these session FDs and anything that was saved
> using this session would be ready for restoring.
>
> My main point is that this can be done with the current interface as
> well as kexecfs. I think there is very much a reason for considering
> kexecfs (like not being dependent on devtmpfs), but I don't think this
> is necessarily the main one.

The main problem I'd like solved is requiring all FDs to preserved and
restored in the context of a central daemon, since I think this will
inevitably cause problems for KVM. I agree with you that this problem
can also be solved in other ways, such as session FDs (good idea!).

>
> >
> > Policy can be enforced by controlling access to kexecfs mounts. This
> > naturally fits into the standard architecture of running untrusted VMs
> > (e.g. using chroots and containers to enforce security and isolation).
>
> How? After a kexec, how do you tell which process can get which kexecfs
> mount/instance? If any of them can get any, then we lose all sort of
> policy enforcement.

I was imagining it's up to whatever process/daemon creates the kexecfs
instances before kexec is also responsible for reassociating them with
the right processes after kexec.

If you are asking how that association would be done mechanically, I
was imagining it would be through a combination of filesystem
permissions, mounts, and chroots. For example, the kexecfs instance
for VM A would be mounted in VM A's chroot. VM A would then only have
access to its own kexecfs instance.

> >> > I'm not sure that we can get rid of the machine-wide state machine
> >> > though, as there is some kernel state that will necessarily cross
> >> > these kexecfs domains (e.g. IOMMU driver state). So we still might
> >> > need /dev/liveupdate for that.
> >>
> >> Generally speaking, I think both VFS-based and IOCTL-based interfaces
> >> are more or less equally expressive/powerful. Most of the ioctl
> >> operations can be translated to a VFS operation and vice versa.
> >>
> >> For example, the fsopen() call is similar to open("/dev/liveupdate") --
> >> both would create a live update session which auto closes when the FD is
> >> closed or FS unmounted. Similarly, each ioctl can be replaced with a
> >> file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be
> >> replaced with a fd_preserve file where you write() the FD number.
> >> LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be
> >> replaced by a "state" file where you can read() or write() the state.
> >>
> >> I think the main benefit of the VFS-based interface is ease of use.
> >> There already exist a bunch of utilites and libraries that we can use to
> >> interact with files. When we have ioctls, we would need to write
> >> everything ourselves. For example, instead of
> >> LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit
> >> easier to do.
> >>
> >> As for downsides, I think we might end up with a bit more boilerplate
> >> code, but beyond that I am not sure.
> >
> > I agree we can more or less get to the same end state with either
> > approach. And also, I don't think we have to do one or the other. I
> > think kexecfs is something that we can build on top of this series.
> > For example, kexecfs would be a new kernel subsystem that registers
> > with LUO.
>
> Yeah, fair point. Though I'd rather we agree on one and go with that.
> Having two interfaces for the same thing isn't the best.

Agreed, tt would be better to have a single way to preserve FDs rather
than 2 (LUO ioctl and kexecfs).