[RFC PATCH 0/6] fs/9p: Reuse inode based on path (in addition to qid)

Tingmao Wang <m@xxxxxxxxxx> · Sun, 6 Apr 2025 21:43:01 +0100

Hi 9p-fs maintainers,

(CC'ing Landlock and Fanotify/inotify people as this affects both use
cases, although most of my investigation has been on Landlock)

Previously [1], I noticed that when using 9pfs filesystems, the Landlock
LSM is blocking access even for files / directories allowed by rules, and
that this has something to do with 9pfs creating new inodes despite
Landlock holding a reference to the existing one.  Because Landlock uses
inodes' in-memory state (i_security) to identify allowed fs
objects/hierarchies, this causes Landlock to partially break on 9pfs, at
least in uncached mode, which is the default:

    # mount -t 9p -o trans=virtio test /mnt
    # env LL_FS_RO=/etc:/usr:/bin:/lib:/mnt/readme LL_FS_RW= /sandboxer bash
    Executing the sandboxed command...
    bash: cannot set terminal process group (164): Inappropriate ioctl for device
    bash: no job control in this shell
    # cat /mnt/readme
    cat: /mnt/readme: Permission denied

This, however, works if somebody is holding onto the dentry (and it also
works with cache=loose), as in both cases the inode is reused:

    # tail -f /mnt/readme &
    [1] 196
    # env LL_FS_RO=/etc:/usr:/bin:/lib:/mnt/readme LL_FS_RW= /sandboxer bash
    Executing the sandboxed command...
    bash: cannot set terminal process group (164): Inappropriate ioctl for device
    bash: no job control in this shell
    # cat /mnt/readme
    aa

It also works on directories if one have a shell that cd into the
directory.  Note that this means only certain usage of Landlock are
affected - for example, sandboxing applications that takes a list of files
to allow, landlocks itself, then evecve.  On the other hand, this does not
affect applications that opens a file, then Landlocks itself while keeping
the file it needs open.

While the above is a very simple example, this is problematic in
real-world use cases if Landlock is used to sandox applications on system
that has files mounted via 9pfs, or use 9pfs as the root filesystem.  In
addition, this also affects fanotify / inotify when using inode mark (for
local access):

    root@d8c28a676d72:/# ./fanotify-basic-open /readme & # on virtiofs
    [1] 173
    root@d8c28a676d72:/# cat readme
    aa
    FAN_OPEN: File /readme
    root@d8c28a676d72:/# mount -t 9p -o trans=virtio test /mnt
    root@d8c28a676d72:/# ./fanotify-basic-open /mnt/readme & # on 9pfs
    [2] 176
    root@d8c28a676d72:/# cat /mnt/readme
    aa
    root@d8c28a676d72:/#

Same can be demonstrated with inotifywait.  The source code for
fanotify-basic-open, adopted from the fanotify man page, is available at
the end of this email.

Note that this is not a security bug for Landlock since it can only cause
legitimate access to be denied, but might be a problem for fanotify perm
(although I do recognize that using perm on individual inodes is already
perhaps a bit unreliable?)

It seems that there was an attempt at making 9pfs reuse inodes, based on
qid.path, however it was reverted [2] due to issues with servers that
present duplicate qids, for example on a QEMU host that has multiple
filesystems mounted under a single 9pfs export without multidevs=remap, or
in the case of other servers that doesn't necessarily support remapping
qids ([3] and more).  I've done some testing on v6.12-rc4 which has the
simplified 9pfs inode code before it was reverted, and found that Landlock
works (however, we of course then have the issue demonstrated in [2]).

Unrelated to the above problem, it also seems like even with the revert in
[2], because in cached mode inode are still reused based on qid (and type,
version (aka mtime), etc), the setup mentioned in [2] still causes
problems in th latest kernel with cache=loose:

    host # cd /tmp/linux-test
    host # mkdir m1 m2
    host # mount -t tmpfs tmpfs m1
    host # mount -t tmpfs tmpfs m2
    host # mkdir m1/dir m2/dir  # needs to be done together so that they have the same mtime
    host # echo foo > m1/dir/foo
    host # echo bar > m2/dir/bar

    guest # mount -t 9p -o trans=virtio,cache=loose test /mnt
    guest # cd /mnt/m1/dir
    qemu-system-x86_64: warning: 9p: Multiple devices detected in same VirtFS export, which might lead to file ID collisions and severe misbehaviours on guest! You should either use a separate export for each device shared from host or use virtfs option 'multidevs=remap'!
    guest # ls
    foo
    guest # ls /mnt/m2/dir
    foo # <- should be bar
    guest # uname -a
    Linux d8c28a676d72 6.14.0-dev #92 SMP PREEMPT_DYNAMIC Sun Apr  6 18:47:54 BST 2025 x86_64 GNU/Linux

With the above in mind, I have a proposal for 9pfs to:
1. Reuse inodes even in uncached mode
2. However, reuse them based on qid.path AND the actual pathname, by doing
   the appropriate testing in v9fs_test(_new)?_inode(_dotl)?

The main problem here is how to store the pathname in a sensible way and
tie it to the inode.  For now I opted with an array of names acquired with
take_dentry_name_snapshot, which reuses the same memory as the dcache to
store the actual strings, but doesn't tie the lifetime of the dentry with
the inode (I thought about holding a reference to the dentry in the
v9fs_inode, but it seemed like a wrong approach and would cause dentries
to not be evicted/released).

Maybe ideally there could be a general way for filesystems to tell
VFS/dcache to "pin" a dentry as long as the inode is alive, but still
allows the dentry and inode to be evicted based on memory pressure?  In
fact, if the dentry is alive, we might not even need to do this in 9pfs,
as we will automatically get the same inode pointed to by the dentry.

Currently this pathname is immutable once attached to an inode, and
therefore it is not protected by locks or RCU, but this might have to
change for us to support renames that preserve the inode on next access.
This is not in this patchset yet.  Also let me know if I've missed any
locks etc (I'm new to VFS, or for that matter, the kernel).

Storing one pathname per inode also means we don't reuse the same inode
for hardlinks -- maybe this can be fixed as well in a future version, if
this approach sounds good?