Re: [PATCH] fhandle: use more consistent rules for decoding file handle from userns

Jan Kara <jack@xxxxxxx> · Thu, 4 Sep 2025 13:23:12 +0200

On Mon 01-09-25 19:47:51, Amir Goldstein wrote:
> On Mon, Sep 1, 2025 at 11:44 AM Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Fri 29-08-25 14:55:13, Amir Goldstein wrote:
> > > On Fri, Aug 29, 2025 at 12:50 PM Jan Kara <jack@xxxxxxx> wrote:
> > > >
> > > > On Wed 27-08-25 21:43:09, Amir Goldstein wrote:
> > > > > Commit 620c266f39493 ("fhandle: relax open_by_handle_at() permission
> > > > > checks") relaxed the coditions for decoding a file handle from non init
> > > > > userns.
> > > > >
> > > > > The conditions are that that decoded dentry is accessible from the user
> > > > > provided mountfd (or to fs root) and that all the ancestors along the
> > > > > path have a valid id mapping in the userns.
> > > > >
> > > > > These conditions are intentionally more strict than the condition that
> > > > > the decoded dentry should be "lookable" by path from the mountfd.
> > > > >
> > > > > For example, the path /home/amir/dir/subdir is lookable by path from
> > > > > unpriv userns of user amir, because /home perms is 755, but the owner of
> > > > > /home does not have a valid id mapping in unpriv userns of user amir.
> > > > >
> > > > > The current code did not check that the decoded dentry itself has a
> > > > > valid id mapping in the userns.  There is no security risk in that,
> > > > > because that final open still performs the needed permission checks,
> > > > > but this is inconsistent with the checks performed on the ancestors,
> > > > > so the behavior can be a bit confusing.
> > > > >
> > > > > Add the check for the decoded dentry itself, so that the entire path,
> > > > > including the last component has a valid id mapping in the userns.
> > > > >
> > > > > Fixes: 620c266f39493 ("fhandle: relax open_by_handle_at() permission checks")
> > > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> > > >
> > > > Yeah, probably it's less surprising this way. Feel free to add:
> > > >
> > >
> > > BTW, Jan, I was trying to think about whether we could do
> > > something useful with privileged_wrt_inode_uidgid() for filtering
> > > events that we queue by group->user_ns.
> > >
> > > Then users could allow something like:
> > > 1. Admin sets up privileged fanotify fd and filesystem watch on
> > >     /home filesystem
> > > 2. Enters userns of amir and does ioctl to change group->user_ns
> > >     to user ns of amir
> > > 3. Hands over fanotify fd to monitor process running in amir's userns
> > > 4. amir's monitor process gets all events on filesystem /home
> > >     whose directory and object uid/gid are mappable to amir's userns
> > > 5. With properly configured systems, that we be all the files/dirs under
> > >     /home/amir
> > >
> > > I have posted several POCs in the past trying different approaches
> > > for filtering by userns, but I have never tried to take this approach.
> > >
> > > Compared to subtree filtering, this could be quite pragmatic? Hmm?
> >
> > This is definitely relatively easy to implement in the kernel. I'm just not
> > sure about two things:
> >
> > 1) Will this be easy enough to use from userspace so that it will get used?
> > Mount watches have been created as a "partial" solution for subtree watches
> > as well. But in practice it didn't get very widespread use as subtree watch
> > replacement because setting up a mountpoint for subtree you want to watch is
> > not flexible enough. Setting up userns and id mappings and proper inode
> > ownership seems like a similar hassle for anything else than a full home
> > dir as well...
> 
> I would not suggest this if it were not for systemd-mountfsd which is
> designed to allow non-root users to mount "trusted" images (e.g. ext4).
> 
> I don't think this feature is already implemented, but an image auto
> generated for the user per demand by mkfs, should also be "trusted".
> 
> In theory, as user jack, you should be able to spawn an unpriv userns
> wherein user jack is uid 0 and get a mount of a freshly formatted ext4 fs
> idmapped in a way that only uids from the userns private range could
> write to that fs.

Ah, I see. Yes, I've heard of similar plans in systemd land.

> *if* this is possible and useful to users, then we will start seeing in
> the wild filesystems where all the inodes are owned by a private range of
> uids, all mappable to a specific userns.

Right. But I expect that the sb->s_user_ns will point to the user's
namespace in that case? So that all the possibly preexisting fs content
gets properly mapped to ids available to the user? If that's the case we'd
already allow placing filesystem mark on such superblocks and there's no
need for filtering?

But I think your original usecase mentioned a different situation with a
filesystem shared by multiple users (/home) but additional idmapping set in
the user namespace where the process is running.

> But TBH, I am not sure if this is already a reality or a likely future or not.
> I need to dig some more to understand the future plans for
> systemd-mountfsd use cases.
> 
> > 2) Filtering all events on the fs only by inode owner being mappable to
> > user ns looks somewhat dangerous to me. Sure you offload the responsibility
> > of the safe setup to userspace but the fact that this completely bypasses
> > any permission checks means that configuring the system so that it does not
> > leak any unintended information (like filenames or facts that some things
> > have changed user otherwise wouldn't be able to see) might be difficult.
> > Consider if e.g. maildir is on your monitored fs and for some reason the
> > UID of the postfix is mapped to your user ns (e.g. because the user needs
> > access to some file/dir managed by postfix). Then you could monitor all
> > fs activity of postfix possibly learning about emails to other persons in
> > the system.
> 
> Well, the rule should be that the user setting group->user_ns is ADMIN
> in that userns.
> 
> If someone has creates a userns where user amir is uid 0 and also
> mapped user postfix into the userns of amir, then that gives user amir
> full privs to access and modify user postfix owned files, so the privilege
> escalation, to the best of my understanding, has already happened way
> before user amir started the fanotify monitor.

Right, sorry, I didn't quite think this through. Indeed as I've checked
e.g. Kubernetes uses disjoint ranges of UIDs to map into user namespaces of
different containers. So filtering filesystem events to inodes whose id is
mappable in such user NS should be OK. But it would be good to verify with
somebody who has more experience with this namespacing stuff than me :)

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR