On 2025-07-22, Aleksa Sarai <cyphar@xxxxxxxxxx> wrote: > On 2025-07-21, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > > On Mon, Jul 21, 2025 at 1:44 AM Aleksa Sarai <cyphar@xxxxxxxxxx> wrote: > > > > > > Ever since the introduction of pid namespaces, procfs has had very > > > implicit behaviour surrounding them (the pidns used by a procfs mount is > > > auto-selected based on the mounting process's active pidns, and the > > > pidns itself is basically hidden once the mount has been constructed). > > > This has historically meant that userspace was required to do some > > > special dances in order to configure the pidns of a procfs mount as > > > desired. Examples include: > > > > > > * In order to bypass the mnt_too_revealing() check, Kubernetes creates > > > a procfs mount from an empty pidns so that user namespaced containers > > > can be nested (without this, the nested containers would fail to > > > mount procfs). But this requires forking off a helper process because > > > you cannot just one-shot this using mount(2). > > > > > > * Container runtimes in general need to fork into a container before > > > configuring its mounts, which can lead to security issues in the case > > > of shared-pidns containers (a privileged process in the pidns can > > > interact with your container runtime process). While > > > SUID_DUMP_DISABLE and user namespaces make this less of an issue, the > > > strict need for this due to a minor uAPI wart is kind of unfortunate. > > > > > > Things would be much easier if there was a way for userspace to just > > > specify the pidns they want. Patch 1 implements a new "pidns" argument > > > which can be set using fsconfig(2): > > > > > > fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); > > > fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); > > > > > > or classic mount(2) / mount(8): > > > > > > // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc > > > mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); > > > > > > The initial security model I have in this RFC is to be as conservative > > > as possible and just mirror the security model for setns(2) -- which > > > means that you can only set pidns=... to pid namespaces that your > > > current pid namespace is a direct ancestor of. This fulfils the > > > requirements of container runtimes, but I suspect that this may be too > > > strict for some usecases. > > > > > > The pidns argument is not displayed in mountinfo -- it's not clear to me > > > what value it would make sense to show (maybe we could just use ns_dname > > > to provide an identifier for the namespace, but this number would be > > > fairly useless to userspace). I'm open to suggestions. > > > > > > In addition, being able to figure out what pid namespace is being used > > > by a procfs mount is quite useful when you have an administrative > > > process (such as a container runtime) which wants to figure out the > > > correct way of mapping PIDs between its own namespace and the namespace > > > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are > > > alternative ways to do this, but they all rely on ancillary information > > > that third-party libraries and tools do not necessarily have access to. > > > > > > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which > > > can be used to get a reference to the pidns that a procfs is using. > > > > > > It's not quite clear what is the correct security model for this API, > > > but the current approach I've taken is to: > > > > > > * Make the ioctl only valid on the root (meaning that a process without > > > access to the procfs root -- such as only having an fd to a procfs > > > file or some open_tree(2)-like subset -- cannot use this API). > > > > > > * Require that the process requesting either has access to > > > /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns > > > pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative > > > access to it and can join it if they had a handle), or is in a pidns > > > that is a direct ancestor of the target pidns (i.e. all of the pids > > > are already visible in the procfs for the current process's pidns). > > > > What's the motivation for the ptrace-read option? While I don't see > > an attack off the top of my head, it seems like creating a procfs > > mount may give write-ish access to things in the pidns (because the > > creator is likely to have CAP_DAC_OVERRIDE, etc) and possibly even > > access to namespace-wide things that aren't inherently visible to > > PID1. > > This latter section is about the privilege model for > ioctl(PROCFS_GET_PID_NAMESPACE), not the pidns= mount flag. pidns= > requires CAP_SYS_ADMIN for pidns->user_ns, in addition to the same > restrictions as pidns_install() (must be a direct ancestor). Maybe I > should add some headers in this cover letter for v2... > > For the ioctl -- if the user can ptrace-read pid1 in the pidns, they can > open a handle to /proc/1/ns/pid which is exactly the same thing they'd > get from PROCFS_GET_PID_NAMESPACE. > > > Even the ancestor check seems dicey. Imagine that uid 1000 makes an > > unprivileged container complete with a userns. Then uid 1001 (outside > > the container) makes its own userns and mountns but stays in the init > > pidns and then mounts (and owns, with all filesystem-related > > capabilities) that mount. Is this really safe? > > As for the ancestor check (for the ioctl), the logic I had was that > being in an ancestor pidns means that you already can see all of the > subprocesses in your own pidns, so it seems strange to not be able to > get a handle to their pidns. Maybe this isn't quite right, idk. > > Ultimately there isn't too much you can do with a pidns fd if you don't > have privileges to join it (the only thing I can think of is that you > could bind-mount it, which could maybe be used to trick an > administrative process if they trusted your mountns for some reason). > > > CAP_SYS_ADMIN seems about right. > > For pidns=, sure. For the ioctl, I think this is overkill. My bad, I forgot to add you to Cc for v2 Andy. PTAL: <https://lore.kernel.org/all/20250723-procfs-pidns-api-v2-0-621e7edd8e40@xxxxxxxxxx/> -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Attachment:
signature.asc
Description: PGP signature