On Fri, Aug 22, 2025 at 12:51:29PM +0200, Christian Brauner wrote: > On Thu, Aug 21, 2025 at 04:18:11PM -0400, Josef Bacik wrote: > > Hello, > > > > This series is the first part of a larger body of work geared towards solving a > > variety of scalability issues in the VFS. > > > > We have historically had a variety of foot-guns related to inode freeing. We > > have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the > > different stages of being reclaimed. This lead to confusion, and bugs in cases > > where one was checked but the other wasn't. Additionally, it's frankly > > confusing to have both of these flags and to deal with them in practice. > > Agreed. > > > However, this exists because we have an odd behavior with inodes, we allow them > > to have a 0 reference count and still be usable. This again is a pretty unfun > > footgun, because generally speaking we want reference counts to be meaningful. > > Agreed. > > > The problem with the way we reference inodes is the final iput(). The majority > > of file systems do their final truncate of a unlinked inode in their > > ->evict_inode() callback, which happens when the inode is actually being > > evicted. This can be a long process for large inodes, and thus isn't safe to > > happen in a variety of contexts. Btrfs, for example, has an entire delayed iput > > infrastructure to make sure that we do not do the final iput() in a dangerous > > context. We cannot expand the use of this reference count to all the places the > > inode is used, because there are cases where we would need to iput() in an IRQ > > context (end folio writeback) or other unsafe context, which is not allowed. > > > > To that end, resolve this by introducing a new i_obj_count reference count. This > > will be used to control when we can actually free the inode. We then can use > > this reference count in all the places where we may reference the inode. This > > removes another huge footgun, having ways to access the inode itself without > > having an actual reference to it. The writeback code is one of the main places > > where we see this. Inodes end up on all sorts of lists here without a proper > > reference count. This allows us to protect the inode from being freed by giving > > this an other code mechanisms to protect their access to the inode. > > > > With this we can separate the concept of the inode being usable, and the inode > > being freed. The next part of the patch series is to stop allowing for inodes > > to have an i_count of 0 and still be viable. This comes with some warts. The > > biggest wart is now if we choose to cache inodes in the LRU list we have to > > remove the inode from the LRU list if we access it once it's on the LRU list. > > This will result in more contention on the lru list lock, but in practice we > > rarely have inodes that do not have a dentry, and if we do that inode is not > > long for this world. > > > > With not allowing inodes to hit a refcount of 0, we can take advantage of that > > common pattern of using refcount_inc_not_zero() in all of the lockless places > > where we do inode lookup in cache. From there we can change all the users who > > check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they > > aren't allowed to do their work, othrwise they can proceed as normal. > > > > With all of that in place we can finally remove these two flags. > > > > This is a large series, but it is mostly mechanical. I've kept the patches very > > small, to make it easy to review and logic about each change. I have run this > > through fstests for btrfs and ext4, xfs is currently going. I wanted to get this > > out for review to make sure this big design changes are reasonable to everybody. > > > > The series is based on vfs/vfs.all branch, which is based on 6.9-rc1. Thanks, > > I so hope you meant 6.17-rc1 because otherwise I did something very very > wrong. :) Stupid AI hallucination... Josef