Re: [PATCH v2 00/11] fhandle, pidfs: allow open_by_handle_at() purely based on file handle

Jan Kara <jack@xxxxxxx> · Tue, 24 Jun 2025 16:15:37 +0200

On Tue 24-06-25 12:59:26, Christian Brauner wrote:
> > For pidfs this means a file handle can function as full replacement for
> > storing a pid in a file. Instead a file handle can be stored and
> > reopened purely based on the file handle.
> 
> One thing I want to comment on generally. I know we document that a file
> handle is an opaque thing and userspace shouldn't rely on the layout or
> format (Propably so that we're free to redefine it.).
> 
> Realistically though that's just not what's happening. I've linked Amir
> to that code already a few times but I'm doing it here for all of you
> again:
> 
> [1]: https://github.com/systemd/systemd/blob/7e1647ae4e33dd8354bd311a7f7f5eb701be2391/src/basic/cgroup-util.c#L62-L77
> 
>      Specifically:
>      
>      /* The structure to pass to name_to_handle_at() on cgroupfs2 */
>      typedef union {
>              struct file_handle file_handle;
>              uint8_t space[offsetof(struct file_handle, f_handle) + sizeof(uint64_t)];
>      } cg_file_handle;
>      
>      #define CG_FILE_HANDLE_INIT                                     \
>              (cg_file_handle) {                                      \
>                      .file_handle.handle_bytes = sizeof(uint64_t),   \
>                      .file_handle.handle_type = FILEID_KERNFS,       \
>              }
>      
>      #define CG_FILE_HANDLE_CGROUPID(fh) (*CAST_ALIGN_PTR(uint64_t, (fh).file_handle.f_handle))
>      
>      cg_file_handle fh = CG_FILE_HANDLE_INIT;
>      CG_FILE_HANDLE_CGROUPID(fh) = id;
>      
>      return RET_NERRNO(open_by_handle_at(cgroupfs_fd, &fh.file_handle, O_DIRECTORY|O_CLOEXEC));
> 
> Another example where the layout is assumed to be uapi/uabi is:
> 
> [2]: https://github.com/systemd/systemd/blob/7e1647ae4e33dd8354bd311a7f7f5eb701be2391/src/basic/pidfd-util.c#L232-L259
> 
>      int pidfd_get_inode_id_impl(int fd, uint64_t *ret) {
>      <snip>
>                      union {
>                              struct file_handle file_handle;
>                              uint8_t space[offsetof(struct file_handle, f_handle) + sizeof(uint64_t)];
>                      } fh = {
>                              .file_handle.handle_bytes = sizeof(uint64_t),
>                              .file_handle.handle_type = FILEID_KERNFS,
>                      };
>                      int mnt_id;
>      
>                      r = RET_NERRNO(name_to_handle_at(fd, "", &fh.file_handle, &mnt_id, AT_EMPTY_PATH));
>                      if (r >= 0) {
>                              if (ret)
>                                      *ret = *CAST_ALIGN_PTR(uint64_t, fh.file_handle.f_handle);
>                              return 0;
>                      }

Thanks for sharing. Sigh... Personal note for the future: If something
should be opaque blob for userspace, don't forget to encrypt the data
before handing it over to userspace. :-P

> In (1) you can see that the layout is assumed to be uabi by
> reconstructing the handle. In (2) you can see that the layout is assumed
> to be uabi by extrating the inode number.
> 
> So both points mean that the "don't rely on it"-ship has already sailed.
> If we get regressions reports for this (and we surely would) because we
> changed it we're bound by the no-regression rule.

Yep, FILEID_KERNFS is pretty much set in stone. OTOH I don't expect these
kinds of hacks to be very widespread (I guess I'm naive ;) so if we really
need to change it we could talk to systemd folks.

> So, for pidfs I'm very tempted to explicitly give the guarantee that
> systemd just assumes currently.
> 
> The reason is that it will allow userspace to just store the 64-bit
> pidfs inode number in a file or wherever they want and then reconstruct
> the file handle without ever having to involve name_to_handle_at(). That
> means you can just read the pid file and see the inode number you're
> dealing with and not some binary gunk.

Well, you could just fprintf() the fhandle into the pid file if you don't
like binary gunk. Those numbers would be telling about as much as the pidfs
inode number tells you, don't they? I mean I'm still not at the point where
I would *encourage* userspace to decode what's supposed to be opaque blob
;). But maybe I'll get used to that idea...

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR