This fix comes after an analysis of trace from a client bug: RCA: Usually two operations cant interleave due to extensive VFS locking. But CEPH maintains leases for cached dentry structures, when these leases expire the client may create a lookup request without taking VFS locks, this is done by choice. But it does leave a window for this type of a race that we have witnessed. The reply handler in one specific place doesn't make sure that the appropriate locks are held but still trusts the r_parent pointer that now holds an incorrect value due to the rename operation. I prepared a fix that now compares the cached r_parent and the ino from the reply. I'll ask @André to prepare a new image; this image should also resolve the boot issue for the previous debug version. It's the message handling code, so the request is always valid but not every request will have a parent. R_PARENT_LOCKED may not be set by design, it's a valid state. On Thu, Jul 31, 2025 at 1:39 AM Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> wrote: > > On Wed, 2025-07-30 at 15:19 +0000, Alex Markuze wrote: > > When the parent directory's i_rwsem is not locked, req->r_parent may become > > stale due to concurrent operations (e.g. rename) between dentry lookup and > > message creation. Validate that r_parent matches the encoded parent inode > > and update to the correct inode if a mismatch is detected. > > Could we share any description of crash or workload misbehavior that can > illustrate the symptoms of the issue? > > > --- > > fs/ceph/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 42 insertions(+), 2 deletions(-) > > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > > index 814f9e9656a0..49fb1e3a02e8 100644 > > --- a/fs/ceph/inode.c > > +++ b/fs/ceph/inode.c > > @@ -56,6 +56,46 @@ static int ceph_set_ino_cb(struct inode *inode, void *data) > > return 0; > > } > > > > +/* > > + * Validate that the directory inode referenced by @req->r_parent matches the > > + * inode number and snapshot id contained in the reply's directory record. If > > + * they do not match – which can theoretically happen if the parent dentry was > > + * moved between the time the request was issued and the reply arrived – fall > > + * back to looking up the correct inode in the inode cache. > > + * > > + * A reference is *always* returned. Callers that receive a different inode > > + * than the original @parent are responsible for dropping the extra reference > > + * once the reply has been processed. > > + */ > > +static struct inode *ceph_get_reply_dir(struct super_block *sb, > > + struct inode *parent, > > + struct ceph_mds_reply_info_parsed *rinfo) > > +{ > > + struct ceph_vino vino; > > + > > + if (unlikely(!rinfo->diri.in)) > > + return parent; /* nothing to compare against */ > > If we could receive parent == NULL, then is it possible to receive rinfo == > NULL? > > > + > > + /* If we didn't have a cached parent inode to begin with, just bail out. */ > > + if (!parent) > > + return NULL; > > Is it normal workflow that parent == NULL? Should we complain about it here? > > > + > > + vino.ino = le64_to_cpu(rinfo->diri.in->ino); > > + vino.snap = le64_to_cpu(rinfo->diri.in->snapid); > > + > > + if (likely(parent && ceph_ino(parent) == vino.ino && ceph_snap(parent) == vino.snap)) > > It looks like long line. Could we introduce a inline static function with good > name here? > > We already checked that parent is not NULL above. Does it makes sense to have > this duplicated check here? > > > + return parent; /* matches – use the original reference */ > > + > > + /* Mismatch – this should be rare. Emit a WARN and obtain the correct inode. */ > > + WARN(1, "ceph: reply dir mismatch (parent %s %llx.%llx reply %llx.%llx)\n", > > + parent ? "valid" : "NULL", > > How parent can be NULL here? We already checked this pointer. And this > construction looks pretty complicated. > > > + parent ? ceph_ino(parent) : 0ULL, > > + parent ? ceph_snap(parent) : 0ULL, > > + vino.ino, vino.snap); > > + > > + return ceph_get_inode(sb, vino, NULL); > > +} > > + > > /** > > * ceph_new_inode - allocate a new inode in advance of an expected create > > * @dir: parent directory for new inode > > @@ -1548,8 +1588,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req) > > } > > > > if (rinfo->head->is_dentry) { > > - struct inode *dir = req->r_parent; > > - > > + /* r_parent may be stale, in cases when R_PARENT_LOCKED is not set, so we need to get the correct inode */ > > Comment is too long for one line. We could have multi-line comment here. If > R_PARENT_LOCKED is not set, then, is it normal execution flow or not? Should we > fix the use-case of not setting R_PARENT_LOCKED? > > Thanks, > Slava. > > > + struct inode *dir = ceph_get_reply_dir(sb, req->r_parent, rinfo); > > if (dir) { > > err = ceph_fill_inode(dir, NULL, &rinfo->diri, > > rinfo->dirfrag, session, -1,