Re: [PATCH v3 2/4] landlock: Fix handling of disconnected directories

Abhinav Saxena <xandfury@xxxxxxxxx> · Thu, 24 Jul 2025 21:54:22 -0600

Hi!

Mickaël Salaün <mic@xxxxxxxxxxx> writes:
> On Thu, Jul 24, 2025 at 04:49:24PM +0200, Günther Noack wrote:
>> On Wed, Jul 23, 2025 at 11:01:42PM +0200, Mickaël Salaün wrote:
>> > On Tue, Jul 22, 2025 at 07:04:02PM +0100, Tingmao Wang wrote:
>> > > On the other hand, I’m still a bit uncertain about the domain check
>> > > semantics.  While it would not cause a rename to be allowed if it is
>> > > otherwise not allowed by any rules on or above the mountpoint, this gets a
>> > > bit weird if we have a situation where renames are allowed on the
>> > > mountpoint or everywhere, but not read/writes, however read/writes are
>> > > allowed directly on a file, but the dir containing that file gets
>> > > disconnected so the sandboxed application can’t read or write to it.
>> > > (Maybe someone would set up such a policy where renames are allowed,
>> > > expecting Landlock to always prevent renames where additional permissions
>> > > would be exposed?)
>> > >
>> > > In the above situation, if the file is then moved to a connected
>> > > directory, it will become readable/writable again.
>> >
>> > We can generalize this issue to not only the end file but any component
>> > of the path: disconnected directories.  In fact, the main issue is the
>> > potential inconsistency of access checks over time (e.g. between two
>> > renames).  This could be exploited to bypass the security checks done
>> > for FS_REFER.
>> >
>> > I see two solutions:
>> >
>> > 1. *Always* walk down to the IS_ROOT directory, and then jump to the
>> >    mount point.  This makes it possible to have consistent access checks
>> >    for renames and open/use.  The first downside is that that would
>> >    change the current behavior for bind mounts that could get more
>> >    access rights (if the policy explicitly sets rights for the hidden
>> >    directories).  The second downside is that we’ll do more walk.
>> >
>> > 2. Return -EACCES (or -ENOENT) for actions involving disconnected
>> >    directories, or renames of disconnected opened files.  This second
>> >    solution is simpler and safer but completely disables the use of
>> >    disconnected directories and the rename of disconnected files for
>> >    sandboxed processes.
>> >
>> > It would be much better to be able to handle opened directories as
>> > (object) capabilities, but that is not currently possible because of the
>> > way paths are handled by the VFS and LSM hooks.
>> >
>> > Tingmao, Günther, Jann, what do you think?
>>
>> I have to admit that so far, I still failed to wrap my head around the
>> full patch set and its possible corner cases.  I hope I did not
>> misunderstand things all too badly below:
>>
>> As far as I understand the proposed patch, we are “checkpointing” the
>> intermediate results of the path walk at every mount point boundary,
>> and in the case where we run into a disconnected directory in one of
>> the nested mount points, we restore from the intermediate result at
>> the previous mount point directory and skip to the next mount point.
>
> Correct
>
>>
>> Visually speaking, if the layout is this (where “:” denotes a
>> mountpoint boundary between the mountpoints MP1, MP2, MP3):
>>
>>                           dirfd
>>                             |
>>           :                 V         :
>> 	  :       ham <— spam <— eggs <— x.txt
>> 	  :    (disconn.)             :
>>           :                           :
>>   / <— foo <— bar <— baz        :
>>           :                           :
>>     MP1                 MP2                  MP3
>>
>> When a process holds a reference to the “spam” directory, which is now
>> disconnected, and invokes openat(dirfd, “eggs/x.txt”, …), then we
>> would:
>>
>>   * traverse x.txt
>>   * traverse eggs (checkpointing the intermediate result) <-.
>>   * traverse spam                                           |
>>   * traverse ham                                            |
>>   * discover that ham is disconnected:                      |
>>      * restore the intermediate result from “eggs” ———’
>>      * continue the walk at foo
>>   * end up at the root
>>
>> So effectively, since the results from “spam” and “ham” are discarded,
>> we would traverse only the inodes in the outer and inner mountpoints
>> MP1 and MP3, but effectively return a result that looks like we did
>> not traverse MP2?
>
> We’d still check MP2’s inode, but otherwise yes.
>

I don’t know if it makes sense, but can access rights be cached as part
of the inode security blob? Although I am not sure if the LSM blob would
exist after unlinking.

But if it does, maybe during unlink, keep the cached rights for MP2,
and during openat():
1. Start at disconnected “spam” inode
2. Check spam->i_security->allowed_access  <- Cached MP2 rights
3. Continue normal path walk with preserved access context

>>
>> Maybe (likely) I misread the code. :) It’s not clear to me what the
>> thinking behind this is.  Also, if there was another directory in
>> between “spam” and “eggs” in MP2, wouldn’t we be missing the access
>> rights attached to this directory?
>
> Yes, we would ignore this access right because we don’t know that the
> path was resolved from spam.
>
>>
>>
>> Regarding the capability approach:
>>
>> I agree that a “capability” approach would be the better solution, but
>> it seems infeasible with the existing LSM hooks at the moment.  I
>> would be in favor of it though.
>
> Yes, it would be a new feature with potential important changes.
>
> In the meantime, we still need a fix for disconnected directories, and
> this fix needs to be backported.  That’s why the capability approach is
> not part of the two solutions. ;)
>
>>
>> To spell it out a bit more explicitly what that would mean in my mind:
>>
>> When a path is looked up relative to a dirfd, the path walk upwards
>> would terminate at the dirfd and use previously calculated access
>> rights stored in the associated struct file.  These access rights
>> would be determined at the time of opening the dirfd, similar to how we
>> are already storing the “truncate” access right today for regular
>> files.
>>
>> (Remark: There might still be corner cases where we have to solve it
>> the hard way, if someone uses “..” together with a dirfd-relative
>> lookup.)
>
> Yep, real capabilities don’t have “..” in their design.  On Linux (and
> Landlock), we need to properly handle “..”, which is challenging.
>
>>
>> I also looked at what it would take to change the LSM hooks to pass
>> the directory that the lookup was done relative to, but it seems that
>> this would have to be passed through a bunch of VFS callbacks as well,
>> which seems like a larger change.  I would be curious whether that
>> would be deemed an acceptable change.
>>
>> —Günther
>>
>>
>> P.S. Related to relative directory lookups, there is some movement in
>> the BSDs as well to use dirfds as capabilities, by adding a flag to
>> open directories that enforces O_BENEATH on subsequent opens:
>>
>>  * <https://undeadly.org/cgi?action=article;sid=20250529080623>
>>  * <https://reviews.freebsd.org/D50371>
>>
>> (both found via <https://news.ycombinator.com/item?id=44575361>)
>>
>> If a dirfd had such a flag, that would get rid of the corner case
>> above.
>
> This would be nice but it would not solve the current issue because we
> cannot force all processes to use this flag (which breaks some use
> cases).
>
> FYI, Capsicum is a more complete implementation:
> <https://man.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4>
> See the vfs.lookup_cap_dotdot sysctl too.

Also, my apologies, as this may be tangential to the current
conversation, but since object-based capabilities were mentioned, I had
some design ideas around it while working on the memfd feature [1]. I
don’t know if the design for object-based capabilities has been
internally formalized yet, but since we’re at this juncture, it would
make me glad if any of this is helpful in any way :)

If I understand things correctly, the domain currently applies to ALL
file operations via paths and persists until the process exits.
Therefore, with disconnected directories, once a path component is
unlinked, security policies can be bypassed, as access checks on
previously visible ancestors might get skipped.

Current Landlock Architecture:
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

Process -> Landlock Domain -> Access Decision

{Filesystem Rules, Network Rules, Scope Restrictions}

Path/Port Resolution + Domain Boundary Checks

Enhanced Architecture with Object Capabilities:
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

Process -> Enhanced Landlock Domain ->   Access Decision
━━

━━
{Path Rules, Network Rules,  (AND)   {FD Capabilities}
 Scope Restrictions}                      |
━━━━━━━━━━━━━━━
 Per-FD Rights 
━━━━━━━━━━━━━━━
Traditional Resolution             (calculated)

Unlike SCOPE which provides coarse-grained blocking, object capabilities
provide with the facility to add domain specific fine-grained individual
FD operations. So that we have:

Child Domain = Parent Domain & New Restrictions
             = {
    path_rules: Parent.path_rules & Child.path_rules,
    net_rules: Parent.net_rules & Child.net_rules,
    scope: Parent.scope | Child.scope,  /* Additive */
    fd_caps: path_rules & net_rules & scope & Child.allowed_fd_operations
}

where the Child domain *must* be more restrictive than the parent. Here
is an example:

/* Example */
Parent Domain = {
    path_rules: [“/var/www” -> READ_FILE|READ_DIR, “/var/log” -> WRITE_FILE],
    net_rules: [“80” -> BIND_TCP, “443” -> BIND_TCP],
    scope: [SIGNAL, ABSTRACT_UNIX],

    /* Auto-derived FD capabilities */
    fd_caps: {
        3: READ_FILE,           /* /var/www/index.html */
        7: READ_DIR,            /* /var/www directory */
        12: WRITE_FILE,         /* /var/log/access.log */
        15: BIND_TCP,           /* socket bound to port 80 */
        20: READ_FILE|READ_DIR  /* /var/www/images/ */
    }
}

/* Child creates new domain with additional restrictions */
Child.new_restrictions = {
    path_rules: ["/var/www" -> READ_FILE only], /* Remove READ_DIR */
    net_rules: [],                              /* Remove all network */
    scope: [SIGNAL, ABSTRACT_UNIX, MEMFD_EXEC], /* Add MEMFD restriction */
}

/* Child FD capabilities = Parent & Child restrictions */
Child.fd_caps = {
    3: READ_FILE,     /* READ_FILE & READ_FILE = READ_FILE */
    7: 0,             /* READ_DIR & READ_FILE = none (no access) */
    12: WRITE_FILE,   /* WRITE_FILE unchanged (not restricted) */
    15: 0,            /* BIND_TCP & none = none (network blocked) */
    20: READ_FILE     /* (READ_FILE|READ_DIR) & READ_FILE = READ_FILE */
}

API Design: Reusing Existing Flags
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

/* Extended ruleset - reuse existing flags where possible */
struct landlock_ruleset_attr {
    __u64 handled_access_fs;      /* Existing: also applies to FDs */
    __u64 handled_access_net;     /* Existing: also applies to FDs */
    __u64 scoped;                 /* Existing: domain boundaries */
    __u64 handled_access_fd;      /* NEW: FD-specific operations only */
};

/* New syscalls */
long landlock_set_fd_capability(int fd, __u64 access_rights, __u32 flags);

/* Reuse existing filesystem/network flags for FD operations */
landlock_set_fd_capability(file_fd, LANDLOCK_ACCESS_FS_READ_FILE, 0);
landlock_set_fd_capability(dir_fd, LANDLOCK_ACCESS_FS_READ_DIR, 0);
landlock_set_fd_capability(sock_fd, LANDLOCK_ACCESS_NET_BIND_TCP, 0);

`============'

With object capabilities, we assign access rights to file descriptors
directly, at open/alloc time, eliminating the need for path resolution
during future use.

This solves the core issue because:
• FDs remain valid even when disconnected, and
• Rights are bound to the object rather than its pathname.

Therefore, openat with dirfd should still work.

int dirfd = open(“/tmp/work", O_RDONLY);        // Connected
unlink(”/tmp/work");                            // Now disconnected
openat(dirfd, “file.txt”, O_RDONLY);            // Still works, FD bound

Moreover, no path resolution is needed at a later stage and sandboxed
processes don’t bypass restrictions.

Would love to hear any feedback and thoughts on this.

Best,
Abhinav

[1] - <https://lore.kernel.org/all/20250719-memfd-exec-v1-0-0ef7feba5821@xxxxxxxxx/>