On 2025-07-24, Christian Brauner <brauner@xxxxxxxxxx> wrote: > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote: > > /proc has historically had very opaque semantics about PID namespaces, > > which is a little unfortunate for container runtimes and other programs > > that deal with switching namespaces very often. One common issue is that > > of converting between PIDs in the process's namespace and PIDs in the > > namespace of /proc. > > > > In principle, it is possible to do this today by opening a pidfd with > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will > > contain a PID value translated to the pid namespace associated with that > > procfs superblock). However, allocating a new file for each PID to be > > converted is less than ideal for programs that may need to scan procfs, > > and it is generally useful for userspace to be able to finally get this > > information from procfs. > > > > So, add a new API for this in the form of an ioctl(2) you can call on > > the root directory of procfs. The returned file descriptor will have > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount > > option, finally allowing userspace full control of the pid namespaces > > associated with procfs instances. > > > > The permission model for this is a bit looser than that of the "pidns" > > mount option, but this is mainly because /proc/1/ns/pid provides the > > same information, so as long as you have access to that magic-link (or > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN > > or being in an ancestor pid namespace) it makes sense to allow userspace > > to grab a handle. setns(2) will still have their own permission checks, > > so being able to open a pidns handle doesn't really provide too many > > other capabilities. > > > > Signed-off-by: Aleksa Sarai <cyphar@xxxxxxxxxx> > > --- > > Documentation/filesystems/proc.rst | 4 +++ > > fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++-- > > include/uapi/linux/fs.h | 3 +++ > > 3 files changed, 59 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > > index c520b9f8a3fd..506383273c9d 100644 > > --- a/Documentation/filesystems/proc.rst > > +++ b/Documentation/filesystems/proc.rst > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like > > will be used by the procfs instance when translating pids. By default, procfs > > will use the calling process's active pid namespace. > > > > +Processes can check which pid namespace is used by a procfs instance by using > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs > > +instance. > > + > > Chapter 5: Filesystem behavior > > ============================== > > > > diff --git a/fs/proc/root.c b/fs/proc/root.c > > index 057c8a125c6e..548a57ec2152 100644 > > --- a/fs/proc/root.c > > +++ b/fs/proc/root.c > > @@ -23,8 +23,10 @@ > > #include <linux/cred.h> > > #include <linux/magic.h> > > #include <linux/slab.h> > > +#include <linux/ptrace.h> > > > > #include "internal.h" > > +#include "../internal.h" > > > > struct proc_fs_context { > > struct pid_namespace *pid_ns; > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx) > > return proc_pid_readdir(file, ctx); > > } > > > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > > +{ > > + switch (cmd) { > > +#ifdef CONFIG_PID_NS > > + case PROCFS_GET_PID_NAMESPACE: { > > + struct pid_namespace *active = task_active_pid_ns(current); > > + struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb); > > + bool can_access_pidns = false; > > + > > + /* > > + * If we are in an ancestors of the pidns, or have join > > + * privileges (CAP_SYS_ADMIN), then it makes sense that we > > + * would be able to grab a handle to the pidns. > > + * > > + * Otherwise, if there is a root process, then being able to > > + * access /proc/$pid/ns/pid is equivalent to this ioctl and so > > + * we should probably match the permission model. For empty > > + * namespaces it seems unlikely for there to be a downside to > > + * allowing unprivileged users to open a handle to it (setns > > + * will fail for unprivileged users anyway). > > + */ > > + can_access_pidns = pidns_is_ancestor(ns, active) || > > + ns_capable(ns->user_ns, CAP_SYS_ADMIN); > > This seems to imply that if @ns is a descendant of @active that the > caller holds privileges over it. Is that actually always true? > > IOW, why is the check different from the previous pidns= mount option > check. I would've expected: > > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active) > > and then the ptrace check as a fallback. That would mirror pidns_install(), and I did think about it. The primary (mostly handwave-y) reasoning I had for making it less strict was that: * If you are in an ancestor pidns, then you can already see those processes in your own /proc. In theory that means that you will be able to access /proc/$pid/ns/pid for at least some subprocess there (even if some subprocesses have SUID_DUMP_DISABLE, that flag is cleared on ). Though hypothetically if they are all running as a different user, this does not apply (and you could create scenarios where a child pidns is owned by a userns that you do not have privileges over -- if you deal with setuid binaries). Maybe that risk means we should just combine them, I'm not sure. * If you have CAP_SYS_ADMIN permissions over the pidns, it seems strange to disallow access even if it is not in an ancestor namespace. This is distinct to pidns_install(), where you want to ensure you cannot escape to a parent pid namespace, this is about getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS). Maybe they should be combined to match pidns_install(), but then I would expect the ptrace_may_access() check to apply to all processes in the pidns to make it less restrictive, which is not something you can practically do (and there is a higher chance that pid1 will have SUID_DUMP_DISABLE than some random subprocess, which almost certainly will not be SUID_DUMP_DISABLE). Fundamentally, I guess I'm still trying to see what the risk is of allowing a process to get a handle to a pidns that they have some kind of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being able to see and address all processes in the namespace, or by being able to open /proc/$pidns_pid1/ns/pid anyway) but cannot join. Then again, maybe the fact that it is kind of strange to explain is enough of a reason to just make it simpler... > > + if (!can_access_pidns) { > > + bool cannot_ptrace_pid1 = false; > > + > > + read_lock(&tasklist_lock); > > + if (ns->child_reaper) > > + cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper, > > + PTRACE_MODE_READ_FSCREDS); > > + read_unlock(&tasklist_lock); > > + can_access_pidns = !cannot_ptrace_pid1; > > + } > > + if (!can_access_pidns) > > + return -EPERM; > > + > > + /* open_namespace() unconditionally consumes the reference. */ > > + get_pid_ns(ns); > > + return open_namespace(to_ns_common(ns)); > > + } > > +#endif /* CONFIG_PID_NS */ > > + default: > > + return -ENOIOCTLCMD; > > + } > > +} > > + > > /* > > * The root /proc directory is special, as it has the > > * <pid> directories. Thus we don't use the generic > > * directory handling functions for that.. > > */ > > static const struct file_operations proc_root_operations = { > > - .read = generic_read_dir, > > - .iterate_shared = proc_root_readdir, > > + .read = generic_read_dir, > > + .iterate_shared = proc_root_readdir, > > .llseek = generic_file_llseek, > > + .unlocked_ioctl = proc_root_ioctl, > > + .compat_ioctl = compat_ptr_ioctl, > > }; > > > > /* > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > index 0bd678a4a10e..aa642cb48feb 100644 > > --- a/include/uapi/linux/fs.h > > +++ b/include/uapi/linux/fs.h > > @@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t; > > > > #define PROCFS_IOCTL_MAGIC 'f' > > > > +/* procfs root ioctls */ > > +#define PROCFS_GET_PID_NAMESPACE _IO(PROCFS_IOCTL_MAGIC, 1) > > + > > /* Pagemap ioctl */ > > #define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg) > > > > > > -- > > 2.50.0 > > -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Attachment:
signature.asc
Description: PGP signature