On Fri, Jul 25, 2025 at 12:24:28PM +1000, Aleksa Sarai wrote: > On 2025-07-24, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote: > > > /proc has historically had very opaque semantics about PID namespaces, > > > which is a little unfortunate for container runtimes and other programs > > > that deal with switching namespaces very often. One common issue is that > > > of converting between PIDs in the process's namespace and PIDs in the > > > namespace of /proc. > > > > > > In principle, it is possible to do this today by opening a pidfd with > > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will > > > contain a PID value translated to the pid namespace associated with that > > > procfs superblock). However, allocating a new file for each PID to be > > > converted is less than ideal for programs that may need to scan procfs, > > > and it is generally useful for userspace to be able to finally get this > > > information from procfs. > > > > > > So, add a new API for this in the form of an ioctl(2) you can call on > > > the root directory of procfs. The returned file descriptor will have > > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount > > > option, finally allowing userspace full control of the pid namespaces > > > associated with procfs instances. > > > > > > The permission model for this is a bit looser than that of the "pidns" > > > mount option, but this is mainly because /proc/1/ns/pid provides the > > > same information, so as long as you have access to that magic-link (or > > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN > > > or being in an ancestor pid namespace) it makes sense to allow userspace > > > to grab a handle. setns(2) will still have their own permission checks, > > > so being able to open a pidns handle doesn't really provide too many > > > other capabilities. > > > > > > Signed-off-by: Aleksa Sarai <cyphar@xxxxxxxxxx> > > > --- > > > Documentation/filesystems/proc.rst | 4 +++ > > > fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++-- > > > include/uapi/linux/fs.h | 3 +++ > > > 3 files changed, 59 insertions(+), 2 deletions(-) > > > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > > > index c520b9f8a3fd..506383273c9d 100644 > > > --- a/Documentation/filesystems/proc.rst > > > +++ b/Documentation/filesystems/proc.rst > > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like > > > will be used by the procfs instance when translating pids. By default, procfs > > > will use the calling process's active pid namespace. > > > > > > +Processes can check which pid namespace is used by a procfs instance by using > > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs > > > +instance. > > > + > > > Chapter 5: Filesystem behavior > > > ============================== > > > > > > diff --git a/fs/proc/root.c b/fs/proc/root.c > > > index 057c8a125c6e..548a57ec2152 100644 > > > --- a/fs/proc/root.c > > > +++ b/fs/proc/root.c > > > @@ -23,8 +23,10 @@ > > > #include <linux/cred.h> > > > #include <linux/magic.h> > > > #include <linux/slab.h> > > > +#include <linux/ptrace.h> > > > > > > #include "internal.h" > > > +#include "../internal.h" > > > > > > struct proc_fs_context { > > > struct pid_namespace *pid_ns; > > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx) > > > return proc_pid_readdir(file, ctx); > > > } > > > > > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) > > > +{ > > > + switch (cmd) { > > > +#ifdef CONFIG_PID_NS > > > + case PROCFS_GET_PID_NAMESPACE: { > > > + struct pid_namespace *active = task_active_pid_ns(current); > > > + struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb); > > > + bool can_access_pidns = false; > > > + > > > + /* > > > + * If we are in an ancestors of the pidns, or have join > > > + * privileges (CAP_SYS_ADMIN), then it makes sense that we > > > + * would be able to grab a handle to the pidns. > > > + * > > > + * Otherwise, if there is a root process, then being able to > > > + * access /proc/$pid/ns/pid is equivalent to this ioctl and so > > > + * we should probably match the permission model. For empty > > > + * namespaces it seems unlikely for there to be a downside to > > > + * allowing unprivileged users to open a handle to it (setns > > > + * will fail for unprivileged users anyway). > > > + */ > > > + can_access_pidns = pidns_is_ancestor(ns, active) || > > > + ns_capable(ns->user_ns, CAP_SYS_ADMIN); > > > > This seems to imply that if @ns is a descendant of @active that the > > caller holds privileges over it. Is that actually always true? > > > > IOW, why is the check different from the previous pidns= mount option > > check. I would've expected: > > > > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active) > > > > and then the ptrace check as a fallback. > > That would mirror pidns_install(), and I did think about it. The primary > (mostly handwave-y) reasoning I had for making it less strict was that: > > * If you are in an ancestor pidns, then you can already see those > processes in your own /proc. In theory that means that you will be > able to access /proc/$pid/ns/pid for at least some subprocess there > (even if some subprocesses have SUID_DUMP_DISABLE, that flag is > cleared on ). > > Though hypothetically if they are all running as a different user, > this does not apply (and you could create scenarios where a child > pidns is owned by a userns that you do not have privileges over -- if > you deal with setuid binaries). Maybe that risk means we should just > combine them, I'm not sure. > > * If you have CAP_SYS_ADMIN permissions over the pidns, it seems > strange to disallow access even if it is not in an ancestor > namespace. This is distinct to pidns_install(), where you want to > ensure you cannot escape to a parent pid namespace, this is about > getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS). > > Maybe they should be combined to match pidns_install(), but then I would > expect the ptrace_may_access() check to apply to all processes in the > pidns to make it less restrictive, which is not something you can > practically do (and there is a higher chance that pid1 will have > SUID_DUMP_DISABLE than some random subprocess, which almost certainly > will not be SUID_DUMP_DISABLE). > > Fundamentally, I guess I'm still trying to see what the risk is of > allowing a process to get a handle to a pidns that they have some kind > of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being There shouldn't be. For example, you kinda implicitly do that with a pidfd, no? Because you can pass the pidfd to setns() instead of a namespace fd itself. Maybe that's the argument you're lookin for?