On Thu, Jun 26, 2025 at 05:42:28PM +0200, Pratyush Yadav wrote: > On Wed, Jun 25 2025, David Matlack wrote: > > > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > >> > > >> > While I agree that a filesystem offers superior introspection and > >> > integration with standard tools, building this complex, stateful > >> > orchestration logic on top of VFS seemed to be forcing a square peg > >> > into a round hole. The ioctl interface, while more opaque, provides a > >> > direct and explicit way to command the state machine and manage these > >> > complex lifecycle and dependency rules. > >> > >> I'm not going to argue that you have to switch to this kexecfs idea > >> but... > >> > >> You're using a character device that's tied to devmptfs. In other words, > >> you're already using a filesystem interface. Literally the whole code > >> here is built on top of filesystem APIs. So this argument is just very > >> wrong imho. If you can built it on top of a character device using VFS > >> interfaces you can do it as a minimal filesystem. > >> > >> You're free to define the filesystem interface any way you like it. We > >> have a ton of examples there. All your ioctls would just be tied to the > >> fileystem instance instead of the /dev/somethingsomething character > >> device. The state machine could just be implemented the same way. > >> > >> One of my points is that with an fs interface you can have easy state > >> seralization on a per-service level. IOW, you have a bunch of virtual > >> machines running as services or some networking services or whatever. > >> You could just bind-mount an instance of kexecfs into the service and > >> the service can persist state into the instance and easily recover it > >> after kexec. > > > > This approach sounds worth exploring more. It would avoid the need for > > a centralized daemon to mediate the preservation and restoration of > > all file descriptors. > > One of the jobs of the centralized daemon is to decide the _policy_ of > who gets to preserve things and more importantly, make sure the right > party unpreserves the right FDs after a kexec. I don't see how this > interface fixes this problem. You would still need a way to identify > which kexecfs instance belongs to who and enforce that. The kernel > probably shouldn't be the one doing this kind of policy so you still > need some userspace component to make those decisions. > > > > > I'm not sure that we can get rid of the machine-wide state machine > > though, as there is some kernel state that will necessarily cross > > these kexecfs domains (e.g. IOMMU driver state). So we still might > > need /dev/liveupdate for that. > > Generally speaking, I think both VFS-based and IOCTL-based interfaces > are more or less equally expressive/powerful. Most of the ioctl > operations can be translated to a VFS operation and vice versa. > > For example, the fsopen() call is similar to open("/dev/liveupdate") -- > both would create a live update session which auto closes when the FD is > closed or FS unmounted. Similarly, each ioctl can be replaced with a > file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be > replaced with a fd_preserve file where you write() the FD number. > LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be > replaced by a "state" file where you can read() or write() the state. > > I think the main benefit of the VFS-based interface is ease of use. > There already exist a bunch of utilites and libraries that we can use to > interact with files. When we have ioctls, we would need to write > everything ourselves. For example, instead of > LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit > easier to do. > > As for downsides, I think we might end up with a bit more boilerplate > code, but beyond that I am not sure. One of the points in Christian's suggestion was that ioctl doesn't have to be bound to a misc device. Even if we don't use read()/write()/link() etc, we can have a filesystem that exposes, say, "control" file and that file has the same liveupdate_ioctl() in its fops as we have now in miscdev. The cost is indeed a bit of boilerplate code to create the filesystem, but it would be easier to extend for per-service and containers support. And we won't need sysfs entry for status, as it can be also pre-populated in kexecfs (or whatever it'll be called). > -- > Regards, > Pratyush Yadav -- Sincerely yours, Mike.