On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@xxxxxxxxxx> wrote: > > Ever since the introduction of pid namespaces, procfs has had very > implicit behaviour surrounding them (the pidns used by a procfs mount is > auto-selected based on the mounting process's active pidns, and the > pidns itself is basically hidden once the mount has been constructed). > This has historically meant that userspace was required to do some > special dances in order to configure the pidns of a procfs mount as > desired. Examples include: > > * In order to bypass the mnt_too_revealing() check, Kubernetes creates > a procfs mount from an empty pidns so that user namespaced containers > can be nested (without this, the nested containers would fail to > mount procfs). But this requires forking off a helper process because > you cannot just one-shot this using mount(2). > > * Container runtimes in general need to fork into a container before > configuring its mounts, which can lead to security issues in the case > of shared-pidns containers (a privileged process in the pidns can > interact with your container runtime process). While > SUID_DUMP_DISABLE and user namespaces make this less of an issue, the > strict need for this due to a minor uAPI wart is kind of unfortunate. > > Things would be much easier if there was a way for userspace to just > specify the pidns they want. Patch 1 implements a new "pidns" argument > which can be set using fsconfig(2): > > fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); > fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); > > or classic mount(2) / mount(8): > > // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc > mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); > > The initial security model I have in this RFC is to be as conservative > as possible and just mirror the security model for setns(2) -- which > means that you can only set pidns=... to pid namespaces that your > current pid namespace is a direct ancestor of. This fulfils the > requirements of container runtimes, but I suspect that this may be too > strict for some usecases. > > The pidns argument is not displayed in mountinfo -- it's not clear to me > what value it would make sense to show (maybe we could just use ns_dname > to provide an identifier for the namespace, but this number would be > fairly useless to userspace). I'm open to suggestions. > > In addition, being able to figure out what pid namespace is being used > by a procfs mount is quite useful when you have an administrative > process (such as a container runtime) which wants to figure out the > correct way of mapping PIDs between its own namespace and the namespace > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are > alternative ways to do this, but they all rely on ancillary information > that third-party libraries and tools do not necessarily have access to. > > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which > can be used to get a reference to the pidns that a procfs is using. > > It's not quite clear what is the correct security model for this API, > but the current approach I've taken is to: > > * Make the ioctl only valid on the root (meaning that a process without > access to the procfs root -- such as only having an fd to a procfs > file or some open_tree(2)-like subset -- cannot use this API). > > * Require that the process requesting either has access to > /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns > pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative > access to it and can join it if they had a handle), or is in a pidns > that is a direct ancestor of the target pidns (i.e. all of the pids > are already visible in the procfs for the current process's pidns). What's the motivation for the ptrace-read option? While I don't see an attack off the top of my head, it seems like creating a procfs mount may give write-ish access to things in the pidns (because the creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even access to namespace-wide things that aren't inherently visible to PID1. Even the ancestor check seems dicey. Imagine that uid 1000 makes an unprivileged container complete with a userns. Then uid 1001 (outside the container) makes its own userns and mountns but stays in the init pidns and then mounts (and owns, with all filesystem-related capabilities) that mount. Is this really safe? CAP_SYS_ADMIN seems about right. --Andy