Re: [PATCH] ceph: Fix multifs mds auth caps issue

Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> · Tue, 12 Aug 2025 14:37:03 +0530

On Tue, Aug 12, 2025 at 2:50 AM Viacheslav Dubeyko
<Slava.Dubeyko@xxxxxxx> wrote:
>
> On Wed, 2025-08-06 at 14:23 +0530, Kotresh Hiremath Ravishankar wrote:
> > On Sat, Aug 2, 2025 at 1: 31 AM Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> wrote: > > On Fri, 2025-08-01 at 22: 59 +0530, Kotresh Hiremath Ravishankar wrote: > > > > Hi, > > > > 1. I will modify the commit message
> >
> > On Sat, Aug 2, 2025 at 1:31 AM Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> wrote:
> > >
> > > On Fri, 2025-08-01 at 22:59 +0530, Kotresh Hiremath Ravishankar wrote:
> > > >
> > > > Hi,
> > > >
> > > > 1. I will modify the commit message to clearly explain the issue in the next revision.
> > > > 2. The maximum possible length for the fsname is not defined in mds side. I didn't find any restriction imposed on the length. So we need to live with it.
> > >
> > > We have two constants in Linux kernel [1]:
> > >
> > > #define NAME_MAX         255    /* # chars in a file name */
> > > #define PATH_MAX        4096    /* # chars in a path name including nul */
> > >
> > > I don't think that fsname can be bigger than PATH_MAX.
> >
> > As I had mentioned earlier, the CephFS server side code is not restricting the filesystem name
> > during creation. I validated the creation of a filesystem name with a length of 5000.
> > Please try the following.
> >
> > [kotresh@fedora build]$ alpha_str=$(< /dev/urandom tr -dc 'a-zA-Z' | head -c 5000)
> > [kotresh@fedora build]$ ceph fs new $alpha_str cephfs_data cephfs_metadata
> > [kotresh@fedora build]$ bin/ceph fs ls
> >
> > So restricting the fsname length in the kclient would likely cause issues. If we need to enforce the limitation, I think, it should be done at server side first and it’s a separate effort.
> >
>
> I am not sure that Linux kernel is capable to digest any name bigger than
> NAME_MAX. Are you sure that we can pass xfstests with filesystem name bigger
> than NAME_MAX? Another point here that we can put buffer with name inline
> into struct ceph_mdsmap if the name cannot be bigger than NAME_MAX, for example.
> In this case we don't need to allocate fs_name memory for every
> ceph_mdsmap_decode() call.

Well, I haven't tried xfstests with a filesystem name bigger than
NAME_MAX. But I did try mounting a ceph filesystem name bigger than
NAME_MAX and it works.
But mounting a ceph filesystem name bigger than PATH_MAX didn't work.
Note that the creation of a ceph filesystem name bigger than PATH_MAX
works and
mounting with the same using fuse client works as well.

I was going through ceph kernel client code, historically, the
filesystem name is stored as a char pointer. The filesystem name from
mount options is stored
into 'struct ceph_mount_options' in 'ceph_parse_new_source' and the
same is used to compare against the fsmap received from the mds in
'ceph_mdsc_handle_fsmap'

struct ceph_mount_options {
    ...
    char *mds_namespace;  /* default NULL */
    ...
};

I am not sure what's the right approach to choose here. In kclient,
assuming ceph fsname not to be bigger than PATH_MAX seems to be safe
as the kclient today is
not able to mount the ceph fsname bigger than PATH_MAX. I also
observed that the kclient failed to mount the ceph fsname with length
little less than
PATH_MAX (4090). So it's breaking somewhere with the entire path
component being considered. Anyway, I will open the discussion to
everyone here.
If we are restricting the max fsname length, we need to restrict it
while creating it in my opinion and fix it across the project both in
kclient fuse.

>
> > >
> > > > 3. I will fix up doutc in the next revision.
> > > > 4. The fs_name is part of the mdsmap in the server side [1]. The kernel client decodes only necessary fields from the mdsmap sent by the server. Until now, the fs_name
> > > >     was not being decoded, as part of this fix, it's required and being decoded.
> > > >
> > >
> > > Correct me if I am wrong. I can create a Ceph cluster with several MDS servers.
> > > In this cluster, I can create multiple file system volumes. And every file
> > > system volume will have some name (fs_name). So, if we store fs_name into
> > > mdsmap, then which name do we imply here? Do we imply cluster name as fs_name or
> > > name of particular file system volume?
> >
> > In CephFS, we mainly deal with two maps MDSMap[1] and FSMap[2]. The MDSMap represents
> > the state for a particular single filesystem. So the ‘fs_name’ in the MDSMap points to a file system
> > name that the MDSMap represents. Each filesystem will have a distinct MDSMap. The FSMap was
> > introduced to support multiple filesystems in the cluster. The FSMap holds all the filesystems in the
> > cluster using the MDSMap of each file system. The clients subscribe to these maps. So when kclient
> > is receiving a mdsmap, it’s corresponding to the filesystem it’s dealing with.
> >
>
> So, it's sounds to me that MDS keeps multiple MDSMaps for multiple file systems.
> And kernel side receives only MDSMap for operations. The FSMap is kept on MDS
> side and kernel never receives it. Am I right here?

No, not really. The kclient decodes the FSMap as well. The fsname and
monitor ip are passed in the mount command, the kclient
contacts the monitor and receives the list of the file systems in the
cluster via FSMap. The passed fsname from the
mount command is compared against the list of file systems in the
FSMap decoded. If the fsname is found, it fetches
the fscid and requests the corresponding mdsmap from the respective
mds using fscid.

>
> Thanks,
> Slava.
>
> > [1] https://github.com/ceph/ceph/blob/main/src/mds/MDSMap.h
> > [2] https://github.com/ceph/ceph/blob/main/src/mds/FSMap.h
> >
> > Thanks,
> > Kotresh H R
> >
> > >
> > > Thanks,
> > > Slava.
> > >
> > > >
> > > >
> > >
> > > [1]
> > > https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/limits.h#L12
> > >
> > > > [1] https://github.com/ceph/ceph/blob/main/src/mds/MDSMap.h#L596
> > > >
> > > > On Tue, Jul 29, 2025 at 11:57 PM Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> wrote:
> > > > > On Tue, 2025-07-29 at 22:32 +0530, khiremat@xxxxxxxxxx wrote:
> > > > > > From: Kotresh HR <khiremat@xxxxxxxxxx>
> > > > > >
> > > > > > The mds auth caps check should also validate the
> > > > > > fsname along with the associated caps. Not doing
> > > > > > so would result in applying the mds auth caps of
> > > > > > one fs on to the other fs in a multifs ceph cluster.
> > > > > > The patch fixes the same.
> > > > > >
> > > > > > Steps to Reproduce (on vstart cluster):
> > > > > > 1. Create two file systems in a cluster, say 'a' and 'b'
> > > > > > 2. ceph fs authorize a client.usr / r
> > > > > > 3. ceph fs authorize b client.usr / rw
> > > > > > 4. ceph auth get client.usr >> ./keyring
> > > > > > 5. sudo bin/mount.ceph usr@.a=/ /kmnt_a_usr/
> > > > > > 6. touch /kmnt_a_usr/file1 (SHOULD NOT BE ALLOWED)
> > > > > > 7. sudo bin/mount.ceph admin@.a=/ /kmnt_a_admin
> > > > > > 8. echo "data" > /kmnt_a_admin/admin_file1
> > > > > > 9. rm -f /kmnt_a_usr/admin_file1 (SHOULD NOT BE ALLOWED)
> > > > > >
> > > > >
> > > > > I think we are missing to explain here which problem or
> > > > > symptoms will see the user that has this issue. I assume that
> > > > > this will be seen as the issue reproduction:
> > > > >
> > > > > With client_3 which has only 1 filesystem in caps is working as expected
> > > > > mkdir /mnt/client_3/test_3
> > > > > mkdir: cannot create directory ‘/mnt/client_3/test_3’: Permission denied
> > > > >
> > > > > Am I correct here?
> > > > >
> > > > > > URL: https://tracker.ceph.com/issues/72167
> > > > > > Signed-off-by: Kotresh HR <khiremat@xxxxxxxxxx>
> > > > > > ---
> > > > > >   fs/ceph/mds_client.c | 10 ++++++++++
> > > > > >   fs/ceph/mdsmap.c     | 11 ++++++++++-
> > > > > >   fs/ceph/mdsmap.h     |  1 +
> > > > > >   3 files changed, 21 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > > > index 9eed6d73a508..ba91f3360ff6 100644
> > > > > > --- a/fs/ceph/mds_client.c
> > > > > > +++ b/fs/ceph/mds_client.c
> > > > > > @@ -5640,11 +5640,21 @@ static int ceph_mds_auth_match(struct ceph_mds_client *mdsc,
> > > > > >        u32 caller_uid = from_kuid(&init_user_ns, cred->fsuid);
> > > > > >        u32 caller_gid = from_kgid(&init_user_ns, cred->fsgid);
> > > > > >        struct ceph_client *cl = mdsc->fsc->client;
> > > > > > +     const char *fs_name = mdsc->mdsmap->fs_name;
> > > > > >        const char *spath = mdsc->fsc->mount_options->server_path;
> > > > > >        bool gid_matched = false;
> > > > > >        u32 gid, tlen, len;
> > > > > >        int i, j;
> > > > > >
> > > > > > +     if (auth->match.fs_name && strcmp(auth->match.fs_name, fs_name)) {
> > > > >
> > > > > Should we consider to use strncmp() here?
> > > > > We should have the limitation of maximum possible name length.
> > > > >
> > > > > > +             doutc(cl, "fsname check failed fs_name=%s  match.fs_name=%s\n",
> > > > > > +                   fs_name, auth->match.fs_name);
> > > > > > +             return 0;
> > > > > > +     } else {
> > > > > > +             doutc(cl, "fsname check passed fs_name=%s  match.fs_name=%s\n",
> > > > > > +                   fs_name, auth->match.fs_name);
> > > > >
> > > > > I assume that we could call the doutc with auth->match.fs_name == NULL. So, I am
> > > > > expecting to have a crash here.
> > > > >
> > > > > > +     }
> > > > > > +
> > > > > >        doutc(cl, "match.uid %lld\n", auth->match.uid);
> > > > > >        if (auth->match.uid != MDS_AUTH_UID_ANY) {
> > > > > >                if (auth->match.uid != caller_uid)
> > > > > > diff --git a/fs/ceph/mdsmap.c b/fs/ceph/mdsmap.c
> > > > > > index 8109aba66e02..f1431ba0b33e 100644
> > > > > > --- a/fs/ceph/mdsmap.c
> > > > > > +++ b/fs/ceph/mdsmap.c
> > > > > > @@ -356,7 +356,15 @@ struct ceph_mdsmap *ceph_mdsmap_decode(struct ceph_mds_client *mdsc, void **p,
> > > > > >                /* enabled */
> > > > > >                ceph_decode_8_safe(p, end, m->m_enabled, bad_ext);
> > > > > >                /* fs_name */
> > > > > > -             ceph_decode_skip_string(p, end, bad_ext);
> > > > > > +             m->fs_name = ceph_extract_encoded_string(p, end, NULL, GFP_NOFS);
> > > > > > +             if (IS_ERR(m->fs_name)) {
> > > > > > +                     err = PTR_ERR(m->fs_name);
> > > > > > +                     m->fs_name = NULL;
> > > > > > +                     if (err == -ENOMEM)
> > > > > > +                             goto out_err;
> > > > > > +                     else
> > > > > > +                             goto bad;
> > > > > > +             }
> > > > > >        }
> > > > > >        /* damaged */
> > > > > >        if (mdsmap_ev >= 9) {
> > > > > > @@ -418,6 +426,7 @@ void ceph_mdsmap_destroy(struct ceph_mdsmap *m)
> > > > > >                kfree(m->m_info);
> > > > > >        }
> > > > > >        kfree(m->m_data_pg_pools);
> > > > > > +     kfree(m->fs_name);
> > > > > >        kfree(m);
> > > > > >   }
> > > > > >
> > > > > > diff --git a/fs/ceph/mdsmap.h b/fs/ceph/mdsmap.h
> > > > > > index 1f2171dd01bf..acb0a2a3627a 100644
> > > > > > --- a/fs/ceph/mdsmap.h
> > > > > > +++ b/fs/ceph/mdsmap.h
> > > > > > @@ -45,6 +45,7 @@ struct ceph_mdsmap {
> > > > > >        bool m_enabled;
> > > > > >        bool m_damaged;
> > > > > >        int m_num_laggy;
> > > > > > +     char *fs_name;
> > > > >
> > > > > The ceph_mdsmap structure describes servers in the mds cluster [1].
> > > > > Semantically, I don't see any relation of fs_name with this structure.
> > > > > As a result, I don't see the point to keep this pointer in this structure.
> > > > > Why the fs_name has been placed in this structure?
> > > > >
> > > > > Thanks,
> > > > > Slava.
> > > > >
> > > > > >   };
> > > > > >
> > > > > >   static inline struct ceph_entity_addr *
> > > > >
> > > > > [1] https://elixir.bootlin.com/linux/v6.16/source/fs/ceph/mdsmap.h#L11
> > > > >