Re: [PATCH RFC 0/4] procfs: make reference pidns more user-visible

Aleksa Sarai <cyphar@xxxxxxxxxx> · Thu, 24 Jul 2025 09:55:05 +1000

On 2025-07-22, Aleksa Sarai <cyphar@xxxxxxxxxx> wrote:
> On 2025-07-21, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@xxxxxxxxxx> wrote:
> > >
> > > Ever since the introduction of pid namespaces, procfs has had very
> > > implicit behaviour surrounding them (the pidns used by a procfs mount is
> > > auto-selected based on the mounting process's active pidns, and the
> > > pidns itself is basically hidden once the mount has been constructed).
> > > This has historically meant that userspace was required to do some
> > > special dances in order to configure the pidns of a procfs mount as
> > > desired. Examples include:
> > >
> > >  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
> > >    a procfs mount from an empty pidns so that user namespaced containers
> > >    can be nested (without this, the nested containers would fail to
> > >    mount procfs). But this requires forking off a helper process because
> > >    you cannot just one-shot this using mount(2).
> > >
> > >  * Container runtimes in general need to fork into a container before
> > >    configuring its mounts, which can lead to security issues in the case
> > >    of shared-pidns containers (a privileged process in the pidns can
> > >    interact with your container runtime process). While
> > >    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
> > >    strict need for this due to a minor uAPI wart is kind of unfortunate.
> > >
> > > Things would be much easier if there was a way for userspace to just
> > > specify the pidns they want. Patch 1 implements a new "pidns" argument
> > > which can be set using fsconfig(2):
> > >
> > >     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> > >     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> > >
> > > or classic mount(2) / mount(8):
> > >
> > >     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> > >     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> > >
> > > The initial security model I have in this RFC is to be as conservative
> > > as possible and just mirror the security model for setns(2) -- which
> > > means that you can only set pidns=... to pid namespaces that your
> > > current pid namespace is a direct ancestor of. This fulfils the
> > > requirements of container runtimes, but I suspect that this may be too
> > > strict for some usecases.
> > >
> > > The pidns argument is not displayed in mountinfo -- it's not clear to me
> > > what value it would make sense to show (maybe we could just use ns_dname
> > > to provide an identifier for the namespace, but this number would be
> > > fairly useless to userspace). I'm open to suggestions.
> > >
> > > In addition, being able to figure out what pid namespace is being used
> > > by a procfs mount is quite useful when you have an administrative
> > > process (such as a container runtime) which wants to figure out the
> > > correct way of mapping PIDs between its own namespace and the namespace
> > > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> > > alternative ways to do this, but they all rely on ancillary information
> > > that third-party libraries and tools do not necessarily have access to.
> > >
> > > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> > > can be used to get a reference to the pidns that a procfs is using.
> > >
> > > It's not quite clear what is the correct security model for this API,
> > > but the current approach I've taken is to:
> > >
> > >  * Make the ioctl only valid on the root (meaning that a process without
> > >    access to the procfs root -- such as only having an fd to a procfs
> > >    file or some open_tree(2)-like subset -- cannot use this API).
> > >
> > >  * Require that the process requesting either has access to
> > >    /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
> > >    pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
> > >    access to it and can join it if they had a handle), or is in a pidns
> > >    that is a direct ancestor of the target pidns (i.e. all of the pids
> > >    are already visible in the procfs for the current process's pidns).
> > 
> > What's the motivation for the ptrace-read option?  While I don't see
> > an attack off the top of my head, it seems like creating a procfs
> > mount may give write-ish access to things in the pidns (because the
> > creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even
> > access to namespace-wide things that aren't inherently visible to
> > PID1.
> 
> This latter section is about the privilege model for
> ioctl(PROCFS_GET_PID_NAMESPACE), not the pidns= mount flag. pidns=
> requires CAP_SYS_ADMIN for pidns->user_ns, in addition to the same
> restrictions as pidns_install() (must be a direct ancestor). Maybe I
> should add some headers in this cover letter for v2...
> 
> For the ioctl -- if the user can ptrace-read pid1 in the pidns, they can
> open a handle to /proc/1/ns/pid which is exactly the same thing they'd
> get from PROCFS_GET_PID_NAMESPACE.
> 
> > Even the ancestor check seems dicey.  Imagine that uid 1000 makes an
> > unprivileged container complete with a userns.  Then uid 1001 (outside
> > the container) makes its own userns and mountns but stays in the init
> > pidns and then mounts (and owns, with all filesystem-related
> > capabilities) that mount.  Is this really safe?
> 
> As for the ancestor check (for the ioctl), the logic I had was that
> being in an ancestor pidns means that you already can see all of the
> subprocesses in your own pidns, so it seems strange to not be able to
> get a handle to their pidns. Maybe this isn't quite right, idk.
> 
> Ultimately there isn't too much you can do with a pidns fd if you don't
> have privileges to join it (the only thing I can think of is that you
> could bind-mount it, which could maybe be used to trick an
> administrative process if they trusted your mountns for some reason).
> 
> > CAP_SYS_ADMIN seems about right.
> 
> For pidns=, sure. For the ioctl, I think this is overkill.

My bad, I forgot to add you to Cc for v2 Andy. PTAL:

 <https://lore.kernel.org/all/20250723-procfs-pidns-api-v2-0-621e7edd8e40@xxxxxxxxxx/>

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Attachment:
signature.asc

Description: PGP signature