Re: [PATCH v2 00/11] fhandle, pidfs: allow open_by_handle_at() purely based on file handle

Amir Goldstein <amir73il@xxxxxxxxx> · Tue, 24 Jun 2025 16:34:04 +0200

On Tue, Jun 24, 2025 at 4:15 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Tue 24-06-25 12:59:26, Christian Brauner wrote:
> > > For pidfs this means a file handle can function as full replacement for
> > > storing a pid in a file. Instead a file handle can be stored and
> > > reopened purely based on the file handle.
> >
> > One thing I want to comment on generally. I know we document that a file
> > handle is an opaque thing and userspace shouldn't rely on the layout or
> > format (Propably so that we're free to redefine it.).
> >
> > Realistically though that's just not what's happening. I've linked Amir
> > to that code already a few times but I'm doing it here for all of you
> > again:
> >
> > [1]: https://github.com/systemd/systemd/blob/7e1647ae4e33dd8354bd311a7f7f5eb701be2391/src/basic/cgroup-util.c#L62-L77
> >
> >      Specifically:
> >
> >      /* The structure to pass to name_to_handle_at() on cgroupfs2 */
> >      typedef union {
> >              struct file_handle file_handle;
> >              uint8_t space[offsetof(struct file_handle, f_handle) + sizeof(uint64_t)];
> >      } cg_file_handle;
> >
> >      #define CG_FILE_HANDLE_INIT                                     \
> >              (cg_file_handle) {                                      \
> >                      .file_handle.handle_bytes = sizeof(uint64_t),   \
> >                      .file_handle.handle_type = FILEID_KERNFS,       \
> >              }
> >
> >      #define CG_FILE_HANDLE_CGROUPID(fh) (*CAST_ALIGN_PTR(uint64_t, (fh).file_handle.f_handle))
> >
> >      cg_file_handle fh = CG_FILE_HANDLE_INIT;
> >      CG_FILE_HANDLE_CGROUPID(fh) = id;
> >
> >      return RET_NERRNO(open_by_handle_at(cgroupfs_fd, &fh.file_handle, O_DIRECTORY|O_CLOEXEC));
> >
> > Another example where the layout is assumed to be uapi/uabi is:
> >
> > [2]: https://github.com/systemd/systemd/blob/7e1647ae4e33dd8354bd311a7f7f5eb701be2391/src/basic/pidfd-util.c#L232-L259
> >
> >      int pidfd_get_inode_id_impl(int fd, uint64_t *ret) {
> >      <snip>
> >                      union {
> >                              struct file_handle file_handle;
> >                              uint8_t space[offsetof(struct file_handle, f_handle) + sizeof(uint64_t)];
> >                      } fh = {
> >                              .file_handle.handle_bytes = sizeof(uint64_t),
> >                              .file_handle.handle_type = FILEID_KERNFS,
> >                      };
> >                      int mnt_id;
> >
> >                      r = RET_NERRNO(name_to_handle_at(fd, "", &fh.file_handle, &mnt_id, AT_EMPTY_PATH));
> >                      if (r >= 0) {
> >                              if (ret)
> >                                      *ret = *CAST_ALIGN_PTR(uint64_t, fh.file_handle.f_handle);
> >                              return 0;
> >                      }
>
> Thanks for sharing. Sigh... Personal note for the future: If something
> should be opaque blob for userspace, don't forget to encrypt the data
> before handing it over to userspace. :-P
>
> > In (1) you can see that the layout is assumed to be uabi by
> > reconstructing the handle. In (2) you can see that the layout is assumed
> > to be uabi by extrating the inode number.
> >
> > So both points mean that the "don't rely on it"-ship has already sailed.
> > If we get regressions reports for this (and we surely would) because we
> > changed it we're bound by the no-regression rule.
>
> Yep, FILEID_KERNFS is pretty much set in stone. OTOH I don't expect these
> kinds of hacks to be very widespread (I guess I'm naive ;) so if we really
> need to change it we could talk to systemd folks.
>
> > So, for pidfs I'm very tempted to explicitly give the guarantee that
> > systemd just assumes currently.
> >
> > The reason is that it will allow userspace to just store the 64-bit
> > pidfs inode number in a file or wherever they want and then reconstruct
> > the file handle without ever having to involve name_to_handle_at(). That
> > means you can just read the pid file and see the inode number you're
> > dealing with and not some binary gunk.
>
> Well, you could just fprintf() the fhandle into the pid file if you don't
> like binary gunk. Those numbers would be telling about as much as the pidfs
> inode number tells you, don't they? I mean I'm still not at the point where
> I would *encourage* userspace to decode what's supposed to be opaque blob
> ;). But maybe I'll get used to that idea...

As I was also writing a similar response, I will share my somewhat
duplicate feedback ;)

I still hold the option that as a rule, we should try not to encourage
crafting file handles in userspace.
Exceptions to the rule are easier to accept for non-persistent pseudo
filesystems and especially in the case that the handle format
is simply the u64 inode number.

But if you want to hand craft the pidfs file handles, I get that
the addition of FILEID_IS_AUTONOMOUS flag is a nuisance,
because you'd want something as simple as the examples you quoted
and not more complicated encodings.

Therefore, see my suggestion to toss the FILEID_IS_AUTONOMOUS flag
as well as the new FILEID_PIDFS type and opt-in to opening fd
without a path fd with FD_PIDFS_ROOT instead, so systemd can continue
to use the same simple hand crafted file handles.

Thanks,
Amir.