Re: [PATCH] fhandle: use more consistent rules for decoding file handle from userns

Amir Goldstein <amir73il@xxxxxxxxx> · Mon, 1 Sep 2025 19:47:51 +0200

On Mon, Sep 1, 2025 at 11:44 AM Jan Kara <jack@xxxxxxx> wrote:
>
> On Fri 29-08-25 14:55:13, Amir Goldstein wrote:
> > On Fri, Aug 29, 2025 at 12:50 PM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Wed 27-08-25 21:43:09, Amir Goldstein wrote:
> > > > Commit 620c266f39493 ("fhandle: relax open_by_handle_at() permission
> > > > checks") relaxed the coditions for decoding a file handle from non init
> > > > userns.
> > > >
> > > > The conditions are that that decoded dentry is accessible from the user
> > > > provided mountfd (or to fs root) and that all the ancestors along the
> > > > path have a valid id mapping in the userns.
> > > >
> > > > These conditions are intentionally more strict than the condition that
> > > > the decoded dentry should be "lookable" by path from the mountfd.
> > > >
> > > > For example, the path /home/amir/dir/subdir is lookable by path from
> > > > unpriv userns of user amir, because /home perms is 755, but the owner of
> > > > /home does not have a valid id mapping in unpriv userns of user amir.
> > > >
> > > > The current code did not check that the decoded dentry itself has a
> > > > valid id mapping in the userns.  There is no security risk in that,
> > > > because that final open still performs the needed permission checks,
> > > > but this is inconsistent with the checks performed on the ancestors,
> > > > so the behavior can be a bit confusing.
> > > >
> > > > Add the check for the decoded dentry itself, so that the entire path,
> > > > including the last component has a valid id mapping in the userns.
> > > >
> > > > Fixes: 620c266f39493 ("fhandle: relax open_by_handle_at() permission checks")
> > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> > >
> > > Yeah, probably it's less surprising this way. Feel free to add:
> > >
> >
> > BTW, Jan, I was trying to think about whether we could do
> > something useful with privileged_wrt_inode_uidgid() for filtering
> > events that we queue by group->user_ns.
> >
> > Then users could allow something like:
> > 1. Admin sets up privileged fanotify fd and filesystem watch on
> >     /home filesystem
> > 2. Enters userns of amir and does ioctl to change group->user_ns
> >     to user ns of amir
> > 3. Hands over fanotify fd to monitor process running in amir's userns
> > 4. amir's monitor process gets all events on filesystem /home
> >     whose directory and object uid/gid are mappable to amir's userns
> > 5. With properly configured systems, that we be all the files/dirs under
> >     /home/amir
> >
> > I have posted several POCs in the past trying different approaches
> > for filtering by userns, but I have never tried to take this approach.
> >
> > Compared to subtree filtering, this could be quite pragmatic? Hmm?
>
> This is definitely relatively easy to implement in the kernel. I'm just not
> sure about two things:
>
> 1) Will this be easy enough to use from userspace so that it will get used?
> Mount watches have been created as a "partial" solution for subtree watches
> as well. But in practice it didn't get very widespread use as subtree watch
> replacement because setting up a mountpoint for subtree you want to watch is
> not flexible enough. Setting up userns and id mappings and proper inode
> ownership seems like a similar hassle for anything else than a full home
> dir as well...

I would not suggest this if it were not for systemd-mountfsd which is
designed to allow non-root users to mount "trusted" images (e.g. ext4).

I don't think this feature is already implemented, but an image auto
generated for the user per demand by mkfs, should also be "trusted".

In theory, as user jack, you should be able to spawn an unpriv userns
wherein user jack is uid 0 and get a mount of a freshly formatted ext4 fs
idmapped in a way that only uids from the userns private range could
write to that fs.

*if* this is possible and useful to users, then we will start seeing in the wild
filesystems where all the inodes are owned by a private range of uids,
all mappable to a specific userns.

But TBH, I am not sure if this is already a reality or a likely future or not.
I need to dig some more to understand the future plans for
systemd-mountfsd use cases.

>
> 2) Filtering all events on the fs only by inode owner being mappable to
> user ns looks somewhat dangerous to me. Sure you offload the responsibility
> of the safe setup to userspace but the fact that this completely bypasses
> any permission checks means that configuring the system so that it does not
> leak any unintended information (like filenames or facts that some things
> have changed user otherwise wouldn't be able to see) might be difficult.
> Consider if e.g. maildir is on your monitored fs and for some reason the
> UID of the postfix is mapped to your user ns (e.g. because the user needs
> access to some file/dir managed by postfix). Then you could monitor all
> fs activity of postfix possibly learning about emails to other persons in
> the system.
>

Well, the rule should be that the user setting group->user_ns is ADMIN
in that userns.

If someone has creates a userns where user amir is uid 0 and also
mapped user postfix into the userns of amir, then that gives user amir
full privs to access and modify user postfix owned files, so the privilege
escalation, to the best of my understanding, has already happened way
before user amir started the fanotify monitor.

> > The difference from subtree filtering is that it shifts the responsibility
> > of making sure that /home/amir and /home/jack have files with uid,gid
> > in different ranges to the OS/runtime, which is a responsibility that
> > some systems are already taking care of anyway.
>
> At this point I'm not convinced there are that many systems where this way
> of filtering would be useful but I could be wrong. The fact that some ID is
> mappable in a namespace looks as kind of weak restriction because you may
> need to map into the namespace various external "system" ids AFAIU. But I
> can see that e.g. for containers the idea of restricting events to inodes
> whose owners are in a range of UIDs may be attractive.

I think that for "system containers" (i.e. a nested OS) this could be
attractive, but I don't feel that I know enough to make an authoritative
statement about this.

Thanks,
Amir.