Re: [PATCH] fhandle: use more consistent rules for decoding file handle from userns

Jan Kara <jack@xxxxxxx> · Mon, 1 Sep 2025 11:44:02 +0200

On Fri 29-08-25 14:55:13, Amir Goldstein wrote:
> On Fri, Aug 29, 2025 at 12:50 PM Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Wed 27-08-25 21:43:09, Amir Goldstein wrote:
> > > Commit 620c266f39493 ("fhandle: relax open_by_handle_at() permission
> > > checks") relaxed the coditions for decoding a file handle from non init
> > > userns.
> > >
> > > The conditions are that that decoded dentry is accessible from the user
> > > provided mountfd (or to fs root) and that all the ancestors along the
> > > path have a valid id mapping in the userns.
> > >
> > > These conditions are intentionally more strict than the condition that
> > > the decoded dentry should be "lookable" by path from the mountfd.
> > >
> > > For example, the path /home/amir/dir/subdir is lookable by path from
> > > unpriv userns of user amir, because /home perms is 755, but the owner of
> > > /home does not have a valid id mapping in unpriv userns of user amir.
> > >
> > > The current code did not check that the decoded dentry itself has a
> > > valid id mapping in the userns.  There is no security risk in that,
> > > because that final open still performs the needed permission checks,
> > > but this is inconsistent with the checks performed on the ancestors,
> > > so the behavior can be a bit confusing.
> > >
> > > Add the check for the decoded dentry itself, so that the entire path,
> > > including the last component has a valid id mapping in the userns.
> > >
> > > Fixes: 620c266f39493 ("fhandle: relax open_by_handle_at() permission checks")
> > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> >
> > Yeah, probably it's less surprising this way. Feel free to add:
> >
> 
> BTW, Jan, I was trying to think about whether we could do
> something useful with privileged_wrt_inode_uidgid() for filtering
> events that we queue by group->user_ns.
> 
> Then users could allow something like:
> 1. Admin sets up privileged fanotify fd and filesystem watch on
>     /home filesystem
> 2. Enters userns of amir and does ioctl to change group->user_ns
>     to user ns of amir
> 3. Hands over fanotify fd to monitor process running in amir's userns
> 4. amir's monitor process gets all events on filesystem /home
>     whose directory and object uid/gid are mappable to amir's userns
> 5. With properly configured systems, that we be all the files/dirs under
>     /home/amir
> 
> I have posted several POCs in the past trying different approaches
> for filtering by userns, but I have never tried to take this approach.
> 
> Compared to subtree filtering, this could be quite pragmatic? Hmm?

This is definitely relatively easy to implement in the kernel. I'm just not
sure about two things:

1) Will this be easy enough to use from userspace so that it will get used?
Mount watches have been created as a "partial" solution for subtree watches
as well. But in practice it didn't get very widespread use as subtree watch
replacement because setting up a mountpoint for subtree you want to watch is
not flexible enough. Setting up userns and id mappings and proper inode
ownership seems like a similar hassle for anything else than a full home
dir as well...

2) Filtering all events on the fs only by inode owner being mappable to
user ns looks somewhat dangerous to me. Sure you offload the responsibility
of the safe setup to userspace but the fact that this completely bypasses
any permission checks means that configuring the system so that it does not
leak any unintended information (like filenames or facts that some things
have changed user otherwise wouldn't be able to see) might be difficult.
Consider if e.g. maildir is on your monitored fs and for some reason the
UID of the postfix is mapped to your user ns (e.g. because the user needs
access to some file/dir managed by postfix). Then you could monitor all
fs activity of postfix possibly learning about emails to other persons in
the system.

> The difference from subtree filtering is that it shifts the responsibility
> of making sure that /home/amir and /home/jack have files with uid,gid
> in different ranges to the OS/runtime, which is a responsibility that
> some systems are already taking care of anyway.

At this point I'm not convinced there are that many systems where this way
of filtering would be useful but I could be wrong. The fact that some ID is
mappable in a namespace looks as kind of weak restriction because you may
need to map into the namespace various external "system" ids AFAIU. But I
can see that e.g. for containers the idea of restricting events to inodes
whose owners are in a range of UIDs may be attractive.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR