Re: [PATCH v2 2/2] fanotify: support watching filesystems and mounts inside userns

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 16 May 2025 20:52:57 +0200

On Fri, May 16, 2025 at 7:28 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> On Wed, May 14, 2025 at 8:39 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> >
> > On Wed, May 14, 2025 at 5:49 PM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Sat 19-04-25 12:06:57, Amir Goldstein wrote:
> > > > An unprivileged user is allowed to create an fanotify group and add
> > > > inode marks, but not filesystem, mntns and mount marks.
> > > >
> > > > Add limited support for setting up filesystem, mntns and mount marks by
> > > > an unprivileged user under the following conditions:
> > > >
> > > > 1.   User has CAP_SYS_ADMIN in the user ns where the group was created
> > > > 2.a. User has CAP_SYS_ADMIN in the user ns where the filesystem was
> > > >      mounted (implies FS_USERNS_MOUNT)
> > > >   OR (in case setting up a mntns or mount mark)
> > > > 2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns
> > > >
> > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> > >
> > > I'm sorry for the delay. Both patches look good to me but I'd like somebody
> > > more versed with user namespaces to double check namespace checks in this
> > > patch are indeed sound (so that we don't introduce some security issue).
> >
> > Good idea!
> >
> > > Christian, can you have a look please?
> > >
> >
> > Christian,
> >
> > Please note that the checks below are loosely modeled after the tests in
> > may_decode_fh(), with some differences:
> >
> > > >
> > > >       /*
> > > > -      * An unprivileged user is not allowed to setup mount nor filesystem
> > > > -      * marks.  This also includes setting up such marks by a group that
> > > > -      * was initialized by an unprivileged user.
> > > > +      * A user is allowed to setup sb/mount/mntns marks only if it is
> > > > +      * capable in the user ns where the group was created.
> > > >        */
> > > > -     if ((!capable(CAP_SYS_ADMIN) ||
> > > > -          FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
> > > > +     if (!ns_capable(group->user_ns, CAP_SYS_ADMIN) &&
> > > >           mark_type != FAN_MARK_INODE)
> > > >               return -EPERM;
> > > >
> >
> > 1. This is an extra restriction. Not sure is need to remain forever,
> > but it reduces
> >     attack surface and does not limit the common use cases,
> >     so I think it's worth to have.
> >
> > > > @@ -1987,12 +1988,27 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> > > >               obj = inode;
> > > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
> > > >               obj = path.mnt;
> > > > +             user_ns = real_mount(obj)->mnt_ns->user_ns;
> > > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_SB) {
> > > >               obj = path.mnt->mnt_sb;
> > > > +             user_ns = path.mnt->mnt_sb->s_user_ns;
> > > >       } else if (obj_type == FSNOTIFY_OBJ_TYPE_MNTNS) {
> > > >               obj = mnt_ns_from_dentry(path.dentry);
> > > > +             user_ns = ((struct mnt_namespace *)obj)->user_ns;
> > > >       }
> > > >
> > > > +     /*
> > > > +      * In addition to being capable in the user ns where group was created,
> > > > +      * the user also needs to be capable in the user ns associated with
> > > > +      * the marked filesystem (for FS_USERNS_MOUNT filesystems) or in the
> > > > +      * user ns associated with the mntns (when marking a mount or mntns).
> > > > +      * This is aligned with the required permissions to open_by_handle_at()
> > > > +      * a directory fid provided with the events.
> > > > +      */
> > > > +     ret = -EPERM;
> > > > +     if (user_ns && !ns_capable(user_ns, CAP_SYS_ADMIN))
> > > > +             goto path_put_and_out;
> > > > +
> >
> > 2. In may_decode_fh() we know the mount that resulting file will be
> >     opened from so we accept
> >     Either capable mnt->mnt_sb->s_user_ns
> >     OR capable real_mount(mnt)->mnt_ns->user_ns
> > whereas here we only check capable mnt->mnt_sb->s_user_ns
> >    when subscribing to fs events on sb
> >    and only check capable real_mount(mnt)->mnt_ns->user_ns
> >    when subscribing to fs events on a specific mount
> >
> > I am not sure if there is a use case to allow watching events on a
> > specific mount for capable mnt->mnt_sb->s_user_ns where
> > real_mount(mnt)->mnt_ns->user_ns is not capable
> > or if that setup is even possible?
> >
> > 3. Unlike may_decode_fh(), we do not check has_locked_children()
> >     Not sure how bad it is to allow receiving events for fs changes in
> >     obstructed paths (with file handles that cannot be opened).
> > If this case does needs to be checked  then perhaps we should
> > check for capable mnt->mnt_sb->s_user_ns also when subscribing
> > to fs events on a mount.
> >
>
> OK, after some thinking I think it is best to align the logic to match
> may_decode_fh() more closely, like this:
>
> @@ -1986,13 +1987,40 @@ static int do_fanotify_mark(int fanotify_fd,
> unsigned int flags, __u64 mask,
>                 inode = path.dentry->d_inode;
>                 obj = inode;
>         } else if (obj_type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
> +               struct mount *mnt = real_mount(path.mnt);
> +
>                 obj = path.mnt;
> +               user_ns = path.mnt->mnt_sb->s_user_ns;
> +               /*
> +                * Do not allow watching a mount with locked mounts on top
> +                * that could be hiding access to paths, unless user is also
> +                * capable on the user ns that created the sb.
> +                */
> +               if (!ns_capable(user_ns, CAP_SYS_ADMIN) &&
> +                   !has_locked_children(mnt, path.mnt->mnt_root))
> +                       user_ns = mnt->mnt_ns->user_ns;

No scratch that.
I will send v2 that is much simpler.

Thanks,
Amir.