On Sat, Apr 05, 2025 at 06:54:28AM +0200, Mateusz Guzik wrote: > On Fri, Apr 4, 2025 at 12:39 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > This allows the VFS to not trip over anonymous inodes and we can add > > asserts based on the mode into the vfs. When we report it to userspace > > we can simply hide the mode to avoid regressions. I've audited all > > direct callers of alloc_anon_inode() and only secretmen overrides i_mode > > and i_op inode operations but it already uses a regular file. > > > [snip] > > diff --git a/fs/libfs.c b/fs/libfs.c > > index 6393d7c49ee6..0ad3336f5b49 100644 > > --- a/fs/libfs.c > > +++ b/fs/libfs.c > > @@ -1647,7 +1647,7 @@ struct inode *alloc_anon_inode(struct super_block *s) > > * that it already _is_ on the dirty list. > > */ > > inode->i_state = I_DIRTY; > > - inode->i_mode = S_IRUSR | S_IWUSR; > > + inode->i_mode = S_IFREG | S_IRUSR | S_IWUSR; > > inode->i_uid = current_fsuid(); > > inode->i_gid = current_fsgid(); > > inode->i_flags |= S_PRIVATE; > > Switching the mode to S_IFREG imo can open a can of worms (perhaps a > dedicated in-kernel only mode would be better? S_IFANON?), but that's No, I don't think we should have weird kernel-only things in i_mode. That's what i_flags is for. > a long and handwave-y subject, so I'm going to stop at this remark. > > I think the key here is to provide an invariant that anon inodes don't > pass MAY_EXEC in may_open() regardless of their specific i_mode and > which the kernel fails to provide at the moment afaics. The current way of relying on IS_REG() in do_open_execat() is broken with current kernel semantics ever since commit 633fb6ac3980 ("exec: move S_ISREG() check earlier"). That commit introduced a WARN_ON_ONCE() around the S_ISREG() check which indicates that it wasn't clear to the authors that this is very likely a reachable WARN_ON_ONCE() via e.g.: int main(int argc, char *argv[]) { int ret, sfd; sigset_t mask; struct signalfd_siginfo fdsi; sigemptyset(&mask); sigaddset(&mask, SIGINT); sigaddset(&mask, SIGQUIT); ret = sigprocmask(SIG_BLOCK, &mask, NULL); if (ret < 0) _exit(1); sfd = signalfd(-1, &mask, 0); if (sfd < 0) _exit(2); ret = fchown(sfd, 5555, 5555); if (ret < 0) _exit(3); ret = fchmod(sfd, 0777); if (ret < 0) _exit(3); /* trigger the WARN_ON_ONCE() */ execveat(sfd, "", NULL, NULL, AT_EMPTY_PATH); _exit(4); } Because the mode and owner of the single anonymous inode in the kernel can be changed by anonyone with suitable privileges. That of course it itself a bug although not a really meaningful one because anonymous inodes don't really figure into path lookup. They cannot be reopened via /proc/<pid>/fd/<nr> and can't be used for lookup itself. So they can only ever serve as direct references. But I'm going to fix that. Similar to pidfs we should simply fail setattr [1]. Given that noone has been hitting the execveat() WARN_ON_ONCE() it's safe to say that no one ever bothered chowning/chmoding anonymous inodes. With that removed it's not possible to exec anonymous inodes but we should harden this by setting SB_I_NOEXEC as well (Secretmem is another very fun example because it uses anonymous inodes but does set S_ISREG already. So chmod() on such an inode and passing it to execveat() could be fun.) Anyway, I'm finishing the patch and testing tomorrow and will send out with all the things I mentioned (unless I find out I'm wrong). [1]: That's been a constant sore spot for me. We currently simply fallback to simple_setattr() if the fs doesn't provide a dedicated ->setattr() function. Imho that's broken and we need to remove that. All fses that need it need to set ->setattr() and if it isn't set EOPNOTSUPP will be returned. So that would need some code audit which fses definitely don't need or expect it. Otherwise we risk setting attributes on things that don't want them changed. That's patch I had on my backlog for quite a while ever since the big xattr/acl rework.