Re: [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This fix comes after an analysis of trace from a client bug:
RCA:
Usually two operations cant interleave due to extensive VFS locking.
But CEPH maintains leases for cached dentry structures, when these
leases expire the client may create a lookup request without taking
VFS locks, this is done by choice. But it does leave a window for this
type of a race that we have witnessed. The reply handler in one
specific place doesn't make sure that the appropriate locks are held
but still trusts the r_parent pointer that now holds an incorrect
value due to the rename operation. I prepared a fix that now compares
the cached r_parent and the ino from the reply. I'll ask @André to
prepare a new image; this image should also resolve the boot issue for
the previous debug version.

It's the message handling code, so the request is always valid but not
every request will have a parent. R_PARENT_LOCKED may not be set by
design, it's a valid state.

On Thu, Jul 31, 2025 at 1:39 AM Viacheslav Dubeyko
<Slava.Dubeyko@xxxxxxx> wrote:
>
> On Wed, 2025-07-30 at 15:19 +0000, Alex Markuze wrote:
> > When the parent directory's i_rwsem is not locked, req->r_parent may become
> > stale due to concurrent operations (e.g. rename) between dentry lookup and
> > message creation. Validate that r_parent matches the encoded parent inode
> > and update to the correct inode if a mismatch is detected.
>
> Could we share any description of crash or workload misbehavior that can
> illustrate the symptoms of the issue?
>
> > ---
> >  fs/ceph/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 42 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 814f9e9656a0..49fb1e3a02e8 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -56,6 +56,46 @@ static int ceph_set_ino_cb(struct inode *inode, void *data)
> >       return 0;
> >  }
> >
> > +/*
> > + * Validate that the directory inode referenced by @req->r_parent matches the
> > + * inode number and snapshot id contained in the reply's directory record.  If
> > + * they do not match – which can theoretically happen if the parent dentry was
> > + * moved between the time the request was issued and the reply arrived – fall
> > + * back to looking up the correct inode in the inode cache.
> > + *
> > + * A reference is *always* returned.  Callers that receive a different inode
> > + * than the original @parent are responsible for dropping the extra reference
> > + * once the reply has been processed.
> > + */
> > +static struct inode *ceph_get_reply_dir(struct super_block *sb,
> > +                                       struct inode *parent,
> > +                                       struct ceph_mds_reply_info_parsed *rinfo)
> > +{
> > +    struct ceph_vino vino;
> > +
> > +    if (unlikely(!rinfo->diri.in))
> > +        return parent; /* nothing to compare against */
>
> If we could receive parent == NULL, then is it possible to receive rinfo ==
> NULL?
>
> > +
> > +    /* If we didn't have a cached parent inode to begin with, just bail out. */
> > +    if (!parent)
> > +        return NULL;
>
> Is it normal workflow that parent == NULL? Should we complain about it here?
>
> > +
> > +    vino.ino  = le64_to_cpu(rinfo->diri.in->ino);
> > +    vino.snap = le64_to_cpu(rinfo->diri.in->snapid);
> > +
> > +    if (likely(parent && ceph_ino(parent) == vino.ino && ceph_snap(parent) == vino.snap))
>
> It looks like long line. Could we introduce a inline static function with good
> name here?
>
> We already checked that parent is not NULL above. Does it makes sense to have
> this duplicated check here?
>
> > +        return parent; /* matches – use the original reference */
> > +
> > +    /* Mismatch – this should be rare.  Emit a WARN and obtain the correct inode. */
> > +    WARN(1, "ceph: reply dir mismatch (parent %s %llx.%llx reply %llx.%llx)\n",
> > +         parent ? "valid" : "NULL",
>
> How parent can be NULL here? We already checked this pointer. And this
> construction looks pretty complicated.
>
> > +         parent ? ceph_ino(parent) : 0ULL,
> > +         parent ? ceph_snap(parent) : 0ULL,
> > +         vino.ino, vino.snap);
> > +
> > +    return ceph_get_inode(sb, vino, NULL);
> > +}
> > +
> >  /**
> >   * ceph_new_inode - allocate a new inode in advance of an expected create
> >   * @dir: parent directory for new inode
> > @@ -1548,8 +1588,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> >       }
> >
> >       if (rinfo->head->is_dentry) {
> > -             struct inode *dir = req->r_parent;
> > -
> > +             /* r_parent may be stale, in cases when R_PARENT_LOCKED is not set, so we need to get the correct inode */
>
> Comment is too long for one line. We could have multi-line comment here. If
> R_PARENT_LOCKED is not set, then, is it normal execution flow or not? Should we
> fix the use-case of not setting R_PARENT_LOCKED?
>
> Thanks,
> Slava.
>
> > +             struct inode *dir = ceph_get_reply_dir(sb, req->r_parent, rinfo);
> >               if (dir) {
> >                       err = ceph_fill_inode(dir, NULL, &rinfo->diri,
> >                                             rinfo->dirfrag, session, -1,






[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux