Re: [PATCH] ceph: Fix multifs mds auth caps issue

Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> · Tue, 19 Aug 2025 19:31:12 +0530

On Wed, Aug 13, 2025 at 11:52 PM Viacheslav Dubeyko
<Slava.Dubeyko@xxxxxxx> wrote:
>
> On Wed, 2025-08-13 at 12:58 +0530, Kotresh Hiremath Ravishankar wrote:
> > On Wed, Aug 13, 2025 at 1:22 AM Viacheslav Dubeyko
> > <Slava.Dubeyko@xxxxxxx> wrote:
> > >
> > > On Tue, 2025-08-12 at 14:37 +0530, Kotresh Hiremath Ravishankar wrote:
> > > > On Tue, Aug 12, 2025 at 2:50 AM Viacheslav Dubeyko
> > > > <Slava.Dubeyko@xxxxxxx> wrote:
> > > > >
> > > > > On Wed, 2025-08-06 at 14:23 +0530, Kotresh Hiremath Ravishankar wrote:
> > > > > > On Sat, Aug 2, 2025 at 1: 31 AM Viacheslav Dubeyko <Slava. Dubeyko@ ibm. com> wrote: > > On Fri, 2025-08-01 at 22: 59 +0530, Kotresh Hiremath Ravishankar wrote: > > > > Hi, > > > > 1. I will modify the commit message
> > > > > >
> > > > > > On Sat, Aug 2, 2025 at 1:31 AM Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> wrote:
> > > > > > >
> > > > > > > On Fri, 2025-08-01 at 22:59 +0530, Kotresh Hiremath Ravishankar wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > 1. I will modify the commit message to clearly explain the issue in the next revision.
> > > > > > > > 2. The maximum possible length for the fsname is not defined in mds side. I didn't find any restriction imposed on the length. So we need to live with it.
> > > > > > >
> > > > > > > We have two constants in Linux kernel [1]:
> > > > > > >
> > > > > > > #define NAME_MAX         255    /* # chars in a file name */
> > > > > > > #define PATH_MAX        4096    /* # chars in a path name including nul */
> > > > > > >
> > > > > > > I don't think that fsname can be bigger than PATH_MAX.
> > > > > >
> > > > > > As I had mentioned earlier, the CephFS server side code is not restricting the filesystem name
> > > > > > during creation. I validated the creation of a filesystem name with a length of 5000.
> > > > > > Please try the following.
> > > > > >
> > > > > > [kotresh@fedora build]$ alpha_str=$(< /dev/urandom tr -dc 'a-zA-Z' | head -c 5000)
> > > > > > [kotresh@fedora build]$ ceph fs new $alpha_str cephfs_data cephfs_metadata
> > > > > > [kotresh@fedora build]$ bin/ceph fs ls
> > > > > >
> > > > > > So restricting the fsname length in the kclient would likely cause issues. If we need to enforce the limitation, I think, it should be done at server side first and it’s a separate effort.
> > > > > >
> > > > >
> > > > > I am not sure that Linux kernel is capable to digest any name bigger than
> > > > > NAME_MAX. Are you sure that we can pass xfstests with filesystem name bigger
> > > > > than NAME_MAX? Another point here that we can put buffer with name inline
> > > > > into struct ceph_mdsmap if the name cannot be bigger than NAME_MAX, for example.
> > > > > In this case we don't need to allocate fs_name memory for every
> > > > > ceph_mdsmap_decode() call.
> > > >
> > > > Well, I haven't tried xfstests with a filesystem name bigger than
> > > > NAME_MAX. But I did try mounting a ceph filesystem name bigger than
> > > > NAME_MAX and it works.
> > > > But mounting a ceph filesystem name bigger than PATH_MAX didn't work.
> > > > Note that the creation of a ceph filesystem name bigger than PATH_MAX
> > > > works and
> > > > mounting with the same using fuse client works as well.
> > > >
> > >
> > > The mount operation creates only root folder. So, probably, kernel can survive
> > > by creating root folder if filesystem name fits into PATH_MAX. However, if we
> > > will try to create another folders and files in the root folder, then path
> > > becomes bigger and bigger. And I think that total path name length should be
> > > lesser than PATH_MAX. So, I could say that it is much safer to assume that
> > > filesystem name should fit into NAME_MAX.
> >
> > I didn't spend time root causing this issue. But logically, it makes
> > sense to fit the NAME_MAX to adhere to PATH_MAX.
> > But where does the filesystem name get used as a path component
> > internally so that it affects functionality ? I can think of
> > /etc/fstab and /proc/mounts, but can that affect functionality ?
> >
>
> Usually, file systems haven't file system name. :) The volume could be
> identified by UUID (128 bytes) that, usually, stored in the superblock. So,
> CephFS could be slightly special if it adds file system name into path for file
> system operations.
>
> But I cannot imagine anyone that needs to create a Ceph filesystem with name
> bigger than NAME_MAX in length. :)
>
> I had hope that file system name is used by CephFS kernel client internally and
> never involved into path operations. If we have some issues here, then we need
> to take a deeper look.
>
> > >
> > > > I was going through ceph kernel client code, historically, the
> > > > filesystem name is stored as a char pointer. The filesystem name from
> > > > mount options is stored
> > > > into 'struct ceph_mount_options' in 'ceph_parse_new_source' and the
> > > > same is used to compare against the fsmap received from the mds in
> > > > 'ceph_mdsc_handle_fsmap'
> > > >
> > > > struct ceph_mount_options {
> > > >     ...
> > > >     char *mds_namespace;  /* default NULL */
> > > >     ...
> > > > };
> > > >
> > >
> > > There is no historical traditions here. :) It is only question of efficiency. If
> > > we know the limit of name, then it could be more efficient to have static name
> > > buffer embedded into the structure instead of dynamic memory allocation.
> > > Because, we allocate memory for frequently accessed objects from kmem_cache or
> > > memory_pool. And allocating memory from SLUB allocator could be not only
> > > inefficient but the allocation request could fail if the system is under memory
> > > pressure.
> >
> > Yes, absolutely correctness and efficiency is what matters. On the
> > correctness part,
> > there are multiple questions/points to be considered.
> >
> > 1. What happens to existing filesystems whose name length is greater
> > than NAME_MAX ?
>
> As I mentioned, usually, file systems haven't file system name.
>
> > 2. We should restrict creation of filesystem names greater than NAME_MAX length.
> > 3. We should also enforce the same on fuse clients.
> > 4. We should do it in all the places in the kernel code where the
> > fsname is stored and used.
> >
> > Thinking on the above lines, enforcing fs name length limitation
> > should be a separate
> > effort and is outside the scope of this patch is my opinion.
> >
> >
>
> I had two points from the beginning of the discussion: (1) consider to use
> inline buffer for file system name, (2) struct ceph_mdsmap is not proper place
> for file system name.
>
> OK. I agree that restriction creation of filesystem names greater than NAME_MAX
> length should be considered as independent task.
>

Are you suggesting to use the inline buffer for fsname with NAME_MAX
as the limit
with this patch ?

> So, if we are receiving filesystem name as mount command's option, then I think
> we need to consider another structure(s) for fs_name and we can store it during
> mount phase. Potentially, we can consider struct ceph_fs_client [1], struct
> ceph_mount_options [2], or struct ceph_client [3]. But I think that struct
> ceph_mount_options looks like a more proper place. What do you think?
>

We do this already. The fsname is saved in the 'struct ceph_mount_options' as we
parse it from the mount options. I think this is used only during the
mount. The
mds server does send mdsmap and fsname is part of it. This will be used for mds
authcaps validation. Are you suggesting not to decode the fsname from mdsmap ?

> Thanks,
> Slava.
>
> [1] https://elixir.bootlin.com/linux/v6.16/source/fs/ceph/super.h#L120
> [2] https://elixir.bootlin.com/linux/v6.16/source/fs/ceph/super.h#L80
> [3]
> https://elixir.bootlin.com/linux/v6.16/source/include/linux/ceph/libceph.h#L115
>
> >
> > >
> > > > I am not sure what's the right approach to choose here. In kclient,
> > > > assuming ceph fsname not to be bigger than PATH_MAX seems to be safe
> > > > as the kclient today is
> > > > not able to mount the ceph fsname bigger than PATH_MAX. I also
> > > > observed that the kclient failed to mount the ceph fsname with length
> > > > little less than
> > > > PATH_MAX (4090). So it's breaking somewhere with the entire path
> > > > component being considered. Anyway, I will open the discussion to
> > > > everyone here.
> > > > If we are restricting the max fsname length, we need to restrict it
> > > > while creating it in my opinion and fix it across the project both in
> > > > kclient fuse.
> > > >
> > > >
> > >
> > > I could say that it is much safer to assume that filesystem name should fit into
> > > NAME_MAX.
> > >
> > >
> > > Thanks,
> > > Slava.
> > >
> > > > >
> > > > > > >
> > > > > > > > 3. I will fix up doutc in the next revision.
> > > > > > > > 4. The fs_name is part of the mdsmap in the server side [1]. The kernel client decodes only necessary fields from the mdsmap sent by the server. Until now, the fs_name
> > > > > > > >     was not being decoded, as part of this fix, it's required and being decoded.
> > > > > > > >
> > > > > > >
> > > > > > > Correct me if I am wrong. I can create a Ceph cluster with several MDS servers.
> > > > > > > In this cluster, I can create multiple file system volumes. And every file
> > > > > > > system volume will have some name (fs_name). So, if we store fs_name into
> > > > > > > mdsmap, then which name do we imply here? Do we imply cluster name as fs_name or
> > > > > > > name of particular file system volume?
> > > > > >
> > > > > > In CephFS, we mainly deal with two maps MDSMap[1] and FSMap[2]. The MDSMap represents
> > > > > > the state for a particular single filesystem. So the ‘fs_name’ in the MDSMap points to a file system
> > > > > > name that the MDSMap represents. Each filesystem will have a distinct MDSMap. The FSMap was
> > > > > > introduced to support multiple filesystems in the cluster. The FSMap holds all the filesystems in the
> > > > > > cluster using the MDSMap of each file system. The clients subscribe to these maps. So when kclient
> > > > > > is receiving a mdsmap, it’s corresponding to the filesystem it’s dealing with.
> > > > > >
> > > > >
> > > > > So, it's sounds to me that MDS keeps multiple MDSMaps for multiple file systems.
> > > > > And kernel side receives only MDSMap for operations. The FSMap is kept on MDS
> > > > > side and kernel never receives it. Am I right here?
> > > >
> > > > No, not really. The kclient decodes the FSMap as well. The fsname and
> > > > monitor ip are passed in the mount command, the kclient
> > > > contacts the monitor and receives the list of the file systems in the
> > > > cluster via FSMap. The passed fsname from the
> > > > mount command is compared against the list of file systems in the
> > > > FSMap decoded. If the fsname is found, it fetches
> > > > the fscid and requests the corresponding mdsmap from the respective
> > > > mds using fscid.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Slava.
> > > > >
> > > > > > [1] https://github.com/ceph/ceph/blob/main/src/mds/MDSMap.h
> > > > > > [2] https://github.com/ceph/ceph/blob/main/src/mds/FSMap.h
> > > > > >
> > > > > > Thanks,
> > > > > > Kotresh H R
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Slava.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > [1]
> > > > > > > https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/limits.h#L12
> > > > > > >
> > > > > > > > [1] https://github.com/ceph/ceph/blob/main/src/mds/MDSMap.h#L596
> > > > > > > >
> > > > > > > > On Tue, Jul 29, 2025 at 11:57 PM Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> wrote:
> > > > > > > > > On Tue, 2025-07-29 at 22:32 +0530, khiremat@xxxxxxxxxx wrote:
> > > > > > > > > > From: Kotresh HR <khiremat@xxxxxxxxxx>
> > > > > > > > > >
> > > > > > > > > > The mds auth caps check should also validate the
> > > > > > > > > > fsname along with the associated caps. Not doing
> > > > > > > > > > so would result in applying the mds auth caps of
> > > > > > > > > > one fs on to the other fs in a multifs ceph cluster.
> > > > > > > > > > The patch fixes the same.
> > > > > > > > > >
> > > > > > > > > > Steps to Reproduce (on vstart cluster):
> > > > > > > > > > 1. Create two file systems in a cluster, say 'a' and 'b'
> > > > > > > > > > 2. ceph fs authorize a client.usr / r
> > > > > > > > > > 3. ceph fs authorize b client.usr / rw
> > > > > > > > > > 4. ceph auth get client.usr >> ./keyring
> > > > > > > > > > 5. sudo bin/mount.ceph usr@.a=/ /kmnt_a_usr/
> > > > > > > > > > 6. touch /kmnt_a_usr/file1 (SHOULD NOT BE ALLOWED)
> > > > > > > > > > 7. sudo bin/mount.ceph admin@.a=/ /kmnt_a_admin
> > > > > > > > > > 8. echo "data" > /kmnt_a_admin/admin_file1
> > > > > > > > > > 9. rm -f /kmnt_a_usr/admin_file1 (SHOULD NOT BE ALLOWED)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think we are missing to explain here which problem or
> > > > > > > > > symptoms will see the user that has this issue. I assume that
> > > > > > > > > this will be seen as the issue reproduction:
> > > > > > > > >
> > > > > > > > > With client_3 which has only 1 filesystem in caps is working as expected
> > > > > > > > > mkdir /mnt/client_3/test_3
> > > > > > > > > mkdir: cannot create directory ‘/mnt/client_3/test_3’: Permission denied
> > > > > > > > >
> > > > > > > > > Am I correct here?
> > > > > > > > >
> > > > > > > > > > URL: https://tracker.ceph.com/issues/72167
> > > > > > > > > > Signed-off-by: Kotresh HR <khiremat@xxxxxxxxxx>
> > > > > > > > > > ---
> > > > > > > > > >   fs/ceph/mds_client.c | 10 ++++++++++
> > > > > > > > > >   fs/ceph/mdsmap.c     | 11 ++++++++++-
> > > > > > > > > >   fs/ceph/mdsmap.h     |  1 +
> > > > > > > > > >   3 files changed, 21 insertions(+), 1 deletion(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > > > > > > > index 9eed6d73a508..ba91f3360ff6 100644
> > > > > > > > > > --- a/fs/ceph/mds_client.c
> > > > > > > > > > +++ b/fs/ceph/mds_client.c
> > > > > > > > > > @@ -5640,11 +5640,21 @@ static int ceph_mds_auth_match(struct ceph_mds_client *mdsc,
> > > > > > > > > >        u32 caller_uid = from_kuid(&init_user_ns, cred->fsuid);
> > > > > > > > > >        u32 caller_gid = from_kgid(&init_user_ns, cred->fsgid);
> > > > > > > > > >        struct ceph_client *cl = mdsc->fsc->client;
> > > > > > > > > > +     const char *fs_name = mdsc->mdsmap->fs_name;
> > > > > > > > > >        const char *spath = mdsc->fsc->mount_options->server_path;
> > > > > > > > > >        bool gid_matched = false;
> > > > > > > > > >        u32 gid, tlen, len;
> > > > > > > > > >        int i, j;
> > > > > > > > > >
> > > > > > > > > > +     if (auth->match.fs_name && strcmp(auth->match.fs_name, fs_name)) {
> > > > > > > > >
> > > > > > > > > Should we consider to use strncmp() here?
> > > > > > > > > We should have the limitation of maximum possible name length.
> > > > > > > > >
> > > > > > > > > > +             doutc(cl, "fsname check failed fs_name=%s  match.fs_name=%s\n",
> > > > > > > > > > +                   fs_name, auth->match.fs_name);
> > > > > > > > > > +             return 0;
> > > > > > > > > > +     } else {
> > > > > > > > > > +             doutc(cl, "fsname check passed fs_name=%s  match.fs_name=%s\n",
> > > > > > > > > > +                   fs_name, auth->match.fs_name);
> > > > > > > > >
> > > > > > > > > I assume that we could call the doutc with auth->match.fs_name == NULL. So, I am
> > > > > > > > > expecting to have a crash here.
> > > > > > > > >
> > > > > > > > > > +     }
> > > > > > > > > > +
> > > > > > > > > >        doutc(cl, "match.uid %lld\n", auth->match.uid);
> > > > > > > > > >        if (auth->match.uid != MDS_AUTH_UID_ANY) {
> > > > > > > > > >                if (auth->match.uid != caller_uid)
> > > > > > > > > > diff --git a/fs/ceph/mdsmap.c b/fs/ceph/mdsmap.c
> > > > > > > > > > index 8109aba66e02..f1431ba0b33e 100644
> > > > > > > > > > --- a/fs/ceph/mdsmap.c
> > > > > > > > > > +++ b/fs/ceph/mdsmap.c
> > > > > > > > > > @@ -356,7 +356,15 @@ struct ceph_mdsmap *ceph_mdsmap_decode(struct ceph_mds_client *mdsc, void **p,
> > > > > > > > > >                /* enabled */
> > > > > > > > > >                ceph_decode_8_safe(p, end, m->m_enabled, bad_ext);
> > > > > > > > > >                /* fs_name */
> > > > > > > > > > -             ceph_decode_skip_string(p, end, bad_ext);
> > > > > > > > > > +             m->fs_name = ceph_extract_encoded_string(p, end, NULL, GFP_NOFS);
> > > > > > > > > > +             if (IS_ERR(m->fs_name)) {
> > > > > > > > > > +                     err = PTR_ERR(m->fs_name);
> > > > > > > > > > +                     m->fs_name = NULL;
> > > > > > > > > > +                     if (err == -ENOMEM)
> > > > > > > > > > +                             goto out_err;
> > > > > > > > > > +                     else
> > > > > > > > > > +                             goto bad;
> > > > > > > > > > +             }
> > > > > > > > > >        }
> > > > > > > > > >        /* damaged */
> > > > > > > > > >        if (mdsmap_ev >= 9) {
> > > > > > > > > > @@ -418,6 +426,7 @@ void ceph_mdsmap_destroy(struct ceph_mdsmap *m)
> > > > > > > > > >                kfree(m->m_info);
> > > > > > > > > >        }
> > > > > > > > > >        kfree(m->m_data_pg_pools);
> > > > > > > > > > +     kfree(m->fs_name);
> > > > > > > > > >        kfree(m);
> > > > > > > > > >   }
> > > > > > > > > >
> > > > > > > > > > diff --git a/fs/ceph/mdsmap.h b/fs/ceph/mdsmap.h
> > > > > > > > > > index 1f2171dd01bf..acb0a2a3627a 100644
> > > > > > > > > > --- a/fs/ceph/mdsmap.h
> > > > > > > > > > +++ b/fs/ceph/mdsmap.h
> > > > > > > > > > @@ -45,6 +45,7 @@ struct ceph_mdsmap {
> > > > > > > > > >        bool m_enabled;
> > > > > > > > > >        bool m_damaged;
> > > > > > > > > >        int m_num_laggy;
> > > > > > > > > > +     char *fs_name;
> > > > > > > > >
> > > > > > > > > The ceph_mdsmap structure describes servers in the mds cluster [1].
> > > > > > > > > Semantically, I don't see any relation of fs_name with this structure.
> > > > > > > > > As a result, I don't see the point to keep this pointer in this structure.
> > > > > > > > > Why the fs_name has been placed in this structure?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Slava.
> > > > > > > > >
> > > > > > > > > >   };
> > > > > > > > > >
> > > > > > > > > >   static inline struct ceph_entity_addr *
> > > > > > > > >
> > > > > > > > > [1] https://elixir.bootlin.com/linux/v6.16/source/fs/ceph/mdsmap.h#L11
> > > > > > > > >
> > >
>