Hi! Mickaël Salaün <mic@xxxxxxxxxxx> writes: > On Thu, Jul 24, 2025 at 04:49:24PM +0200, Günther Noack wrote: >> On Wed, Jul 23, 2025 at 11:01:42PM +0200, Mickaël Salaün wrote: >> > On Tue, Jul 22, 2025 at 07:04:02PM +0100, Tingmao Wang wrote: >> > > On the other hand, I’m still a bit uncertain about the domain check >> > > semantics. While it would not cause a rename to be allowed if it is >> > > otherwise not allowed by any rules on or above the mountpoint, this gets a >> > > bit weird if we have a situation where renames are allowed on the >> > > mountpoint or everywhere, but not read/writes, however read/writes are >> > > allowed directly on a file, but the dir containing that file gets >> > > disconnected so the sandboxed application can’t read or write to it. >> > > (Maybe someone would set up such a policy where renames are allowed, >> > > expecting Landlock to always prevent renames where additional permissions >> > > would be exposed?) >> > > >> > > In the above situation, if the file is then moved to a connected >> > > directory, it will become readable/writable again. >> > >> > We can generalize this issue to not only the end file but any component >> > of the path: disconnected directories. In fact, the main issue is the >> > potential inconsistency of access checks over time (e.g. between two >> > renames). This could be exploited to bypass the security checks done >> > for FS_REFER. >> > >> > I see two solutions: >> > >> > 1. *Always* walk down to the IS_ROOT directory, and then jump to the >> > mount point. This makes it possible to have consistent access checks >> > for renames and open/use. The first downside is that that would >> > change the current behavior for bind mounts that could get more >> > access rights (if the policy explicitly sets rights for the hidden >> > directories). The second downside is that we’ll do more walk. >> > >> > 2. Return -EACCES (or -ENOENT) for actions involving disconnected >> > directories, or renames of disconnected opened files. This second >> > solution is simpler and safer but completely disables the use of >> > disconnected directories and the rename of disconnected files for >> > sandboxed processes. >> > >> > It would be much better to be able to handle opened directories as >> > (object) capabilities, but that is not currently possible because of the >> > way paths are handled by the VFS and LSM hooks. >> > >> > Tingmao, Günther, Jann, what do you think? >> >> I have to admit that so far, I still failed to wrap my head around the >> full patch set and its possible corner cases. I hope I did not >> misunderstand things all too badly below: >> >> As far as I understand the proposed patch, we are “checkpointing” the >> intermediate results of the path walk at every mount point boundary, >> and in the case where we run into a disconnected directory in one of >> the nested mount points, we restore from the intermediate result at >> the previous mount point directory and skip to the next mount point. > > Correct > >> >> Visually speaking, if the layout is this (where “:” denotes a >> mountpoint boundary between the mountpoints MP1, MP2, MP3): >> >> dirfd >> | >> : V : >> : ham <— spam <— eggs <— x.txt >> : (disconn.) : >> : : >> / <— foo <— bar <— baz : >> : : >> MP1 MP2 MP3 >> >> When a process holds a reference to the “spam” directory, which is now >> disconnected, and invokes openat(dirfd, “eggs/x.txt”, …), then we >> would: >> >> * traverse x.txt >> * traverse eggs (checkpointing the intermediate result) <-. >> * traverse spam | >> * traverse ham | >> * discover that ham is disconnected: | >> * restore the intermediate result from “eggs” ———’ >> * continue the walk at foo >> * end up at the root >> >> So effectively, since the results from “spam” and “ham” are discarded, >> we would traverse only the inodes in the outer and inner mountpoints >> MP1 and MP3, but effectively return a result that looks like we did >> not traverse MP2? > > We’d still check MP2’s inode, but otherwise yes. > I don’t know if it makes sense, but can access rights be cached as part of the inode security blob? Although I am not sure if the LSM blob would exist after unlinking. But if it does, maybe during unlink, keep the cached rights for MP2, and during openat(): 1. Start at disconnected “spam” inode 2. Check spam->i_security->allowed_access <- Cached MP2 rights 3. Continue normal path walk with preserved access context >> >> Maybe (likely) I misread the code. :) It’s not clear to me what the >> thinking behind this is. Also, if there was another directory in >> between “spam” and “eggs” in MP2, wouldn’t we be missing the access >> rights attached to this directory? > > Yes, we would ignore this access right because we don’t know that the > path was resolved from spam. > >> >> >> Regarding the capability approach: >> >> I agree that a “capability” approach would be the better solution, but >> it seems infeasible with the existing LSM hooks at the moment. I >> would be in favor of it though. > > Yes, it would be a new feature with potential important changes. > > In the meantime, we still need a fix for disconnected directories, and > this fix needs to be backported. That’s why the capability approach is > not part of the two solutions. ;) > >> >> To spell it out a bit more explicitly what that would mean in my mind: >> >> When a path is looked up relative to a dirfd, the path walk upwards >> would terminate at the dirfd and use previously calculated access >> rights stored in the associated struct file. These access rights >> would be determined at the time of opening the dirfd, similar to how we >> are already storing the “truncate” access right today for regular >> files. >> >> (Remark: There might still be corner cases where we have to solve it >> the hard way, if someone uses “..” together with a dirfd-relative >> lookup.) > > Yep, real capabilities don’t have “..” in their design. On Linux (and > Landlock), we need to properly handle “..”, which is challenging. > >> >> I also looked at what it would take to change the LSM hooks to pass >> the directory that the lookup was done relative to, but it seems that >> this would have to be passed through a bunch of VFS callbacks as well, >> which seems like a larger change. I would be curious whether that >> would be deemed an acceptable change. >> >> —Günther >> >> >> P.S. Related to relative directory lookups, there is some movement in >> the BSDs as well to use dirfds as capabilities, by adding a flag to >> open directories that enforces O_BENEATH on subsequent opens: >> >> * <https://undeadly.org/cgi?action=article;sid=20250529080623> >> * <https://reviews.freebsd.org/D50371> >> >> (both found via <https://news.ycombinator.com/item?id=44575361>) >> >> If a dirfd had such a flag, that would get rid of the corner case >> above. > > This would be nice but it would not solve the current issue because we > cannot force all processes to use this flag (which breaks some use > cases). > > FYI, Capsicum is a more complete implementation: > <https://man.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4> > See the vfs.lookup_cap_dotdot sysctl too. Also, my apologies, as this may be tangential to the current conversation, but since object-based capabilities were mentioned, I had some design ideas around it while working on the memfd feature [1]. I don’t know if the design for object-based capabilities has been internally formalized yet, but since we’re at this juncture, it would make me glad if any of this is helpful in any way :) If I understand things correctly, the domain currently applies to ALL file operations via paths and persists until the process exits. Therefore, with disconnected directories, once a path component is unlinked, security policies can be bypassed, as access checks on previously visible ancestors might get skipped. Current Landlock Architecture: ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― Process -> Landlock Domain -> Access Decision {Filesystem Rules, Network Rules, Scope Restrictions} Path/Port Resolution + Domain Boundary Checks Enhanced Architecture with Object Capabilities: ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― Process -> Enhanced Landlock Domain -> Access Decision ━━ ━━ {Path Rules, Network Rules, (AND) {FD Capabilities} Scope Restrictions} | ━━━━━━━━━━━━━━━ Per-FD Rights ━━━━━━━━━━━━━━━ Traditional Resolution (calculated) Unlike SCOPE which provides coarse-grained blocking, object capabilities provide with the facility to add domain specific fine-grained individual FD operations. So that we have: Child Domain = Parent Domain & New Restrictions = { path_rules: Parent.path_rules & Child.path_rules, net_rules: Parent.net_rules & Child.net_rules, scope: Parent.scope | Child.scope, /* Additive */ fd_caps: path_rules & net_rules & scope & Child.allowed_fd_operations } where the Child domain *must* be more restrictive than the parent. Here is an example: /* Example */ Parent Domain = { path_rules: [“/var/www” -> READ_FILE|READ_DIR, “/var/log” -> WRITE_FILE], net_rules: [“80” -> BIND_TCP, “443” -> BIND_TCP], scope: [SIGNAL, ABSTRACT_UNIX], /* Auto-derived FD capabilities */ fd_caps: { 3: READ_FILE, /* /var/www/index.html */ 7: READ_DIR, /* /var/www directory */ 12: WRITE_FILE, /* /var/log/access.log */ 15: BIND_TCP, /* socket bound to port 80 */ 20: READ_FILE|READ_DIR /* /var/www/images/ */ } } /* Child creates new domain with additional restrictions */ Child.new_restrictions = { path_rules: ["/var/www" -> READ_FILE only], /* Remove READ_DIR */ net_rules: [], /* Remove all network */ scope: [SIGNAL, ABSTRACT_UNIX, MEMFD_EXEC], /* Add MEMFD restriction */ } /* Child FD capabilities = Parent & Child restrictions */ Child.fd_caps = { 3: READ_FILE, /* READ_FILE & READ_FILE = READ_FILE */ 7: 0, /* READ_DIR & READ_FILE = none (no access) */ 12: WRITE_FILE, /* WRITE_FILE unchanged (not restricted) */ 15: 0, /* BIND_TCP & none = none (network blocked) */ 20: READ_FILE /* (READ_FILE|READ_DIR) & READ_FILE = READ_FILE */ } API Design: Reusing Existing Flags ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― /* Extended ruleset - reuse existing flags where possible */ struct landlock_ruleset_attr { __u64 handled_access_fs; /* Existing: also applies to FDs */ __u64 handled_access_net; /* Existing: also applies to FDs */ __u64 scoped; /* Existing: domain boundaries */ __u64 handled_access_fd; /* NEW: FD-specific operations only */ }; /* New syscalls */ long landlock_set_fd_capability(int fd, __u64 access_rights, __u32 flags); /* Reuse existing filesystem/network flags for FD operations */ landlock_set_fd_capability(file_fd, LANDLOCK_ACCESS_FS_READ_FILE, 0); landlock_set_fd_capability(dir_fd, LANDLOCK_ACCESS_FS_READ_DIR, 0); landlock_set_fd_capability(sock_fd, LANDLOCK_ACCESS_NET_BIND_TCP, 0); `============' With object capabilities, we assign access rights to file descriptors directly, at open/alloc time, eliminating the need for path resolution during future use. This solves the core issue because: • FDs remain valid even when disconnected, and • Rights are bound to the object rather than its pathname. Therefore, openat with dirfd should still work. int dirfd = open(“/tmp/work", O_RDONLY); // Connected unlink(”/tmp/work"); // Now disconnected openat(dirfd, “file.txt”, O_RDONLY); // Still works, FD bound Moreover, no path resolution is needed at a later stage and sandboxed processes don’t bypass restrictions. Would love to hear any feedback and thoughts on this. Best, Abhinav [1] - <https://lore.kernel.org/all/20250719-memfd-exec-v1-0-0ef7feba5821@xxxxxxxxx/>