Re: [PATCH RFC 0/4] procfs: make reference pidns more user-visible

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Mon, 21 Jul 2025 07:54:25 -0700

On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@xxxxxxxxxx> wrote:
>
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> This has historically meant that userspace was required to do some
> special dances in order to configure the pidns of a procfs mount as
> desired. Examples include:
>
>  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
>    a procfs mount from an empty pidns so that user namespaced containers
>    can be nested (without this, the nested containers would fail to
>    mount procfs). But this requires forking off a helper process because
>    you cannot just one-shot this using mount(2).
>
>  * Container runtimes in general need to fork into a container before
>    configuring its mounts, which can lead to security issues in the case
>    of shared-pidns containers (a privileged process in the pidns can
>    interact with your container runtime process). While
>    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
>    strict need for this due to a minor uAPI wart is kind of unfortunate.
>
> Things would be much easier if there was a way for userspace to just
> specify the pidns they want. Patch 1 implements a new "pidns" argument
> which can be set using fsconfig(2):
>
>     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
>
> or classic mount(2) / mount(8):
>
>     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
>
> The initial security model I have in this RFC is to be as conservative
> as possible and just mirror the security model for setns(2) -- which
> means that you can only set pidns=... to pid namespaces that your
> current pid namespace is a direct ancestor of. This fulfils the
> requirements of container runtimes, but I suspect that this may be too
> strict for some usecases.
>
> The pidns argument is not displayed in mountinfo -- it's not clear to me
> what value it would make sense to show (maybe we could just use ns_dname
> to provide an identifier for the namespace, but this number would be
> fairly useless to userspace). I'm open to suggestions.
>
> In addition, being able to figure out what pid namespace is being used
> by a procfs mount is quite useful when you have an administrative
> process (such as a container runtime) which wants to figure out the
> correct way of mapping PIDs between its own namespace and the namespace
> for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> alternative ways to do this, but they all rely on ancillary information
> that third-party libraries and tools do not necessarily have access to.
>
> To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> can be used to get a reference to the pidns that a procfs is using.
>
> It's not quite clear what is the correct security model for this API,
> but the current approach I've taken is to:
>
>  * Make the ioctl only valid on the root (meaning that a process without
>    access to the procfs root -- such as only having an fd to a procfs
>    file or some open_tree(2)-like subset -- cannot use this API).
>
>  * Require that the process requesting either has access to
>    /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
>    pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
>    access to it and can join it if they had a handle), or is in a pidns
>    that is a direct ancestor of the target pidns (i.e. all of the pids
>    are already visible in the procfs for the current process's pidns).

What's the motivation for the ptrace-read option?  While I don't see
an attack off the top of my head, it seems like creating a procfs
mount may give write-ish access to things in the pidns (because the
creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even
access to namespace-wide things that aren't inherently visible to
PID1.

Even the ancestor check seems dicey.  Imagine that uid 1000 makes an
unprivileged container complete with a userns.  Then uid 1001 (outside
the container) makes its own userns and mountns but stays in the init
pidns and then mounts (and owns, with all filesystem-related
capabilities) that mount.  Is this really safe?

CAP_SYS_ADMIN seems about right.

--Andy