On Sat, Apr 12, 2025 at 01:22:38PM -0700, Linus Torvalds wrote: > On Sat, 12 Apr 2025 at 09:26, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > I plopped your snippet towards the end of __ext4_iget: > > That's literally where I did the same thing, except I put it right after the > > brelse(iloc.bh); > > line, rather than before as you did. > > And it made no difference for me, but I didn't try to figure out why. > Maybe some environment differences? Or maybe I just screwed up my > testing... > > As mentioned earlier in the thread, I had this bi-modal distribution > of results, because if I had a load where the *non*-owner of the inode > looked up the pathnames, then the ACL information would get filled in > when the VFS layer would do the lookup, and then once the ACLs were > cached, everything worked beautifully. > > But if the only lookups of a path were done by the owner of the inodes > (which is typical for at least my normal kernel build tree - nothing > but my build will look at the files, and they are obviously always > owned by me) then the ACL caches will never be filled because there > will never be any real ACL lookups. > > And then rather than doing the nice efficient "no ACLs anywhere, no > need to even look", it ends up having to actually do the vfsuid > comparison for the UID equality check. > > Which then does the extra accesses to look up the idmap etc, and is > visible in the profiles due to that whole dance: > > /* Are we the owner? If so, ACL's don't matter */ > vfsuid = i_uid_into_vfsuid(idmap, inode); > if (likely(vfsuid_eq_kuid(vfsuid, current_fsuid()))) { > > even when idmap is 'nop_mnt_idmap' and it is reasonably cheap. Just > because it ends up calling out to different functions and does extra > D$ accesses to the inode and the suberblock (ie i_user_ns() is this > > return inode->i_sb->s_user_ns; I think we can improve this. Right now multiple mounts from different superblocks can share the same struct mnt_idmap. But I can change the code so that struct mnt_idmap can only be shared between mounts from the same superblock. With that we could do: diff --git a/fs/mnt_idmapping.c b/fs/mnt_idmapping.c index a37991fdb194..a5ec15c8c754 100644 --- a/fs/mnt_idmapping.c +++ b/fs/mnt_idmapping.c @@ -20,6 +20,7 @@ struct mnt_idmap { struct uid_gid_map uid_map; struct uid_gid_map gid_map; + struct user_namespace *s_user_ns; refcount_t count; }; And then stuff like: static inline vfsuid_t i_uid_into_vfsuid(struct mnt_idmap *idmap, const struct inode *inode) { return make_vfsuid(idmap, i_user_ns(inode), inode->i_uid); } just becomes: static inline vfsuid_t i_uid_into_vfsuid(struct mnt_idmap *idmap, const struct inode *inode) { return make_vfsuid(idmap, inode->i_uid); } which means: vfsuid_t make_vfsuid(struct mnt_idmap *idmap, kuid_t kuid) { uid_t uid; if (idmap == &nop_mnt_idmap) return VFSUIDT_INIT(kuid); <snip> } will only have to verify nop_mnt_idmap and we never have to access the inode->i_sb->s_user_ns at all. I'll wip up a patch for this. > > so just to *see* that it's nop_mnt_idmap takes effort. > > One improvement might be to cache that 'nop_mnt_idmap' thing in the > inode as a flag. > > But it would be even better if the filesystem just initializes the > inode at inode read time to say "I have no ACL's for this inode" and > none of this code will even trigger. Yes, let's please do this.