Re: [PATCH v2 00/11] fhandle, pidfs: allow open_by_handle_at() purely based on file handle

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 24 Jun 2025 12:59:26 +0200

> For pidfs this means a file handle can function as full replacement for
> storing a pid in a file. Instead a file handle can be stored and
> reopened purely based on the file handle.

One thing I want to comment on generally. I know we document that a file
handle is an opaque thing and userspace shouldn't rely on the layout or
format (Propably so that we're free to redefine it.).

Realistically though that's just not what's happening. I've linked Amir
to that code already a few times but I'm doing it here for all of you
again:

[1]: https://github.com/systemd/systemd/blob/7e1647ae4e33dd8354bd311a7f7f5eb701be2391/src/basic/cgroup-util.c#L62-L77

     Specifically:

     /* The structure to pass to name_to_handle_at() on cgroupfs2 */
     typedef union {
             struct file_handle file_handle;
             uint8_t space[offsetof(struct file_handle, f_handle) + sizeof(uint64_t)];
     } cg_file_handle;

     #define CG_FILE_HANDLE_INIT                                     \
             (cg_file_handle) {                                      \
                     .file_handle.handle_bytes = sizeof(uint64_t),   \
                     .file_handle.handle_type = FILEID_KERNFS,       \
             }

     #define CG_FILE_HANDLE_CGROUPID(fh) (*CAST_ALIGN_PTR(uint64_t, (fh).file_handle.f_handle))

     cg_file_handle fh = CG_FILE_HANDLE_INIT;
     CG_FILE_HANDLE_CGROUPID(fh) = id;

     return RET_NERRNO(open_by_handle_at(cgroupfs_fd, &fh.file_handle, O_DIRECTORY|O_CLOEXEC));

Another example where the layout is assumed to be uapi/uabi is:

[2]: https://github.com/systemd/systemd/blob/7e1647ae4e33dd8354bd311a7f7f5eb701be2391/src/basic/pidfd-util.c#L232-L259

     int pidfd_get_inode_id_impl(int fd, uint64_t *ret) {
     <snip>
                     union {
                             struct file_handle file_handle;
                             uint8_t space[offsetof(struct file_handle, f_handle) + sizeof(uint64_t)];
                     } fh = {
                             .file_handle.handle_bytes = sizeof(uint64_t),
                             .file_handle.handle_type = FILEID_KERNFS,
                     };
                     int mnt_id;

                     r = RET_NERRNO(name_to_handle_at(fd, "", &fh.file_handle, &mnt_id, AT_EMPTY_PATH));
                     if (r >= 0) {
                             if (ret)
                                     *ret = *CAST_ALIGN_PTR(uint64_t, fh.file_handle.f_handle);
                             return 0;
                     }

In (1) you can see that the layout is assumed to be uabi by
reconstructing the handle. In (2) you can see that the layout is assumed
to be uabi by extrating the inode number.

So both points mean that the "don't rely on it"-ship has already sailed.
If we get regressions reports for this (and we surely would) because we
changed it we're bound by the no-regression rule.

So, for pidfs I'm very tempted to explicitly give the guarantee that
systemd just assumes currently.

The reason is that it will allow userspace to just store the 64-bit
pidfs inode number in a file or wherever they want and then reconstruct
the file handle without ever having to involve name_to_handle_at(). That
means you can just read the pid file and see the inode number you're
dealing with and not some binary gunk.