On Mon, Jul 14, 2025 at 7:56 AM Pratyush Yadav <pratyush@xxxxxxxxxx> wrote: > On Thu, Jun 26 2025, David Matlack wrote: > > On Thu, Jun 26, 2025 at 8:42 AM Pratyush Yadav <pratyush@xxxxxxxxxx> wrote: > >> On Wed, Jun 25 2025, David Matlack wrote: > >> > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > >> >> > > >> >> > While I agree that a filesystem offers superior introspection and > >> >> > integration with standard tools, building this complex, stateful > >> >> > orchestration logic on top of VFS seemed to be forcing a square peg > >> >> > into a round hole. The ioctl interface, while more opaque, provides a > >> >> > direct and explicit way to command the state machine and manage these > >> >> > complex lifecycle and dependency rules. > >> >> > >> >> I'm not going to argue that you have to switch to this kexecfs idea > >> >> but... > >> >> > >> >> You're using a character device that's tied to devmptfs. In other words, > >> >> you're already using a filesystem interface. Literally the whole code > >> >> here is built on top of filesystem APIs. So this argument is just very > >> >> wrong imho. If you can built it on top of a character device using VFS > >> >> interfaces you can do it as a minimal filesystem. > >> >> > >> >> You're free to define the filesystem interface any way you like it. We > >> >> have a ton of examples there. All your ioctls would just be tied to the > >> >> fileystem instance instead of the /dev/somethingsomething character > >> >> device. The state machine could just be implemented the same way. > >> >> > >> >> One of my points is that with an fs interface you can have easy state > >> >> seralization on a per-service level. IOW, you have a bunch of virtual > >> >> machines running as services or some networking services or whatever. > >> >> You could just bind-mount an instance of kexecfs into the service and > >> >> the service can persist state into the instance and easily recover it > >> >> after kexec. > >> > > >> > This approach sounds worth exploring more. It would avoid the need for > >> > a centralized daemon to mediate the preservation and restoration of > >> > all file descriptors. > >> > >> One of the jobs of the centralized daemon is to decide the _policy_ of > >> who gets to preserve things and more importantly, make sure the right > >> party unpreserves the right FDs after a kexec. I don't see how this > >> interface fixes this problem. You would still need a way to identify > >> which kexecfs instance belongs to who and enforce that. The kernel > >> probably shouldn't be the one doing this kind of policy so you still > >> need some userspace component to make those decisions. > > > > The main benefits I see of kexecfs is that it avoids needing to send > > all FDs over UDS to/from liveupdated and therefore the need for > > dynamic cross-process communication (e.g. RPCs). > > > > Instead, something just needs to set up a kexecfs for each VM when it > > is created, and give the same kexecfs back to each VM after kexec. > > Then VMs are free to save/restore any FDs in that kexecfs without > > cross-process communication or transferring file descriptors. > > Isn't giving back the right kexecfs instance to the right VMM the main > problem? After a kexec, you need a way to make that policy decision. You > would need a userspace agent to do that. > > I think what you are suggesting does make a lot of sense -- the agent > should be handing out sessions instead of FDs, which would make FD > save/restore simpler for applications. But that can be done using the > ioctl interface as well. Each time you open() the /dev/liveupdate, you > get a new session. Instead of file FDs like memfd or iommufs, we can > have the agent hand out these session FDs and anything that was saved > using this session would be ready for restoring. > > My main point is that this can be done with the current interface as > well as kexecfs. I think there is very much a reason for considering > kexecfs (like not being dependent on devtmpfs), but I don't think this > is necessarily the main one. The main problem I'd like solved is requiring all FDs to preserved and restored in the context of a central daemon, since I think this will inevitably cause problems for KVM. I agree with you that this problem can also be solved in other ways, such as session FDs (good idea!). > > > > > Policy can be enforced by controlling access to kexecfs mounts. This > > naturally fits into the standard architecture of running untrusted VMs > > (e.g. using chroots and containers to enforce security and isolation). > > How? After a kexec, how do you tell which process can get which kexecfs > mount/instance? If any of them can get any, then we lose all sort of > policy enforcement. I was imagining it's up to whatever process/daemon creates the kexecfs instances before kexec is also responsible for reassociating them with the right processes after kexec. If you are asking how that association would be done mechanically, I was imagining it would be through a combination of filesystem permissions, mounts, and chroots. For example, the kexecfs instance for VM A would be mounted in VM A's chroot. VM A would then only have access to its own kexecfs instance. > >> > I'm not sure that we can get rid of the machine-wide state machine > >> > though, as there is some kernel state that will necessarily cross > >> > these kexecfs domains (e.g. IOMMU driver state). So we still might > >> > need /dev/liveupdate for that. > >> > >> Generally speaking, I think both VFS-based and IOCTL-based interfaces > >> are more or less equally expressive/powerful. Most of the ioctl > >> operations can be translated to a VFS operation and vice versa. > >> > >> For example, the fsopen() call is similar to open("/dev/liveupdate") -- > >> both would create a live update session which auto closes when the FD is > >> closed or FS unmounted. Similarly, each ioctl can be replaced with a > >> file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be > >> replaced with a fd_preserve file where you write() the FD number. > >> LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be > >> replaced by a "state" file where you can read() or write() the state. > >> > >> I think the main benefit of the VFS-based interface is ease of use. > >> There already exist a bunch of utilites and libraries that we can use to > >> interact with files. When we have ioctls, we would need to write > >> everything ourselves. For example, instead of > >> LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit > >> easier to do. > >> > >> As for downsides, I think we might end up with a bit more boilerplate > >> code, but beyond that I am not sure. > > > > I agree we can more or less get to the same end state with either > > approach. And also, I don't think we have to do one or the other. I > > think kexecfs is something that we can build on top of this series. > > For example, kexecfs would be a new kernel subsystem that registers > > with LUO. > > Yeah, fair point. Though I'd rather we agree on one and go with that. > Having two interfaces for the same thing isn't the best. Agreed, tt would be better to have a single way to preserve FDs rather than 2 (LUO ioctl and kexecfs).