On Tue, Aug 26, 2025 at 11:39:00AM -0400, Josef Bacik wrote: Hi Josef, I read through the entire patchset and I think I got the hang of it. Bottom line is I disagree with the core idea of the patchset and majority of the justification raised in the cover letter. :) I'll be very to the point, trying to be as clear as possible and consequently lacking in soft-speak. Based on your name I presume you are also of Slavic descent, hopefully making it fine ;-) I don't have a vote per se so this is not really a NAK. Instead I'm making a case to you and VFS maintaienrs to not include this. ACHTUNG: this is *really* long and I probably forgot to mention something. Frankly the patchset seems to be a way to help btrfs by providing a new refcount (but not in a generic-friendly manner) while taking issue with refcount 0 having a "the inode is good to go if need be" meaning. I provide detailed reasoning below. It warrants noting there is a lot of plain crap in the VFS layer. Between the wtf flags, bad docs for them, poor assert coverage, open-coded & repeated access to stuff (including internal state), I have to say someone(tm) needs to take a hammer to it. However, as far as I can tell, modulo the quality of how things are expressed in the code (so to speak), the crux of what the layer is doing in terms of inode management follows idiomatic behavior I would expect to see, I just needs to be done better. While there are perfectly legitimate reasons to introduce a "hold" reference counter, I pose the patchset at hand does not justify its introduction. If anything I will argue it would be a regression to do it the way it is proposed here, even if some variant of the new counter will find a use case. > This series is the first part of a larger body of work geared towards solving a > variety of scalability issues in the VFS. > Elsewhere in the thread it is mentioned that there is a plan to remove the inode LRU and replace the inode hash with xarray after these changes. I don't understand how this patchset paves the way for either of those things. If anything, per notes from other people, it would probably be best if the inode LRU got removed first and this patchset got rebased on it (if it is to land at all). For the inode hash the real difficulty is not really in terms of implementing something, but evaluating available options. Even if the statically-allocated hash should go (it probably should), the hashing function is not doing a good job (read: the hash is artificially underperforming) and merely replacing it with something else might not give an accurate picture whether the new pick for the data structure is genuinely the right choice (due to skewed comparison as the hash is gimped, both in terms of hashing func and global locking). The minor technical problem which is there in the stock kernel and which remains unaddressed by your patchset is the need to take ->i_lock. Some of later commentary in this cover letter claims this is sorted out, but that's only true if someone already has a ref (as in the lock is only optionally ommitted). In particular, if one was to implement fine-grained locking for the hash with bitlocks, I'm told the resulting ordering of bitlock -> spinlock would be problematic on RT kernels as the former type is a hack which literally only spins and does not support any form of preemption. The ordering can be swapped around to spinlock -> bitlock thanks to RCU (e.g., for deletion from the hash you would find the inode using RCU traversal, lock it, lock the chain and only then delete etc.). Since your patchset keeps the lock in place, the kernel is in the same boat in both cases (also if the new thing only uses spinlocks). As far as I know the other non-fs specific bottlenecks for inode handling are the super block list and dentry LRU, neither of which benefit from the patchset either. So again I don't see how scalability work is facilitated by this patchset. > We have historically had a variety of foot-guns related to inode freeing. We > have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the > different stages of being reclaimed. This lead to confusion, and bugs in cases > where one was checked but the other wasn't. Additionally, it's frankly > confusing to have both of these flags and to deal with them in practice. > Per my opening remark I agree this situation is very poorly handled in the current code. If my grep is right the only real consumer of I_WILL_FREE is ocfs2. In your patchset your just remove the usage. Given that other filesystems manage without it, I suspect the real solution is to change its ->drop_inode to generic_delete_inode() and handle the write in ->evict_inode. The doc for the flag is most unhelpful, documenting how the flag is used but not explaining what for. If I understood things correctly the flag is only there to prevent ->i_count acquire by other threads while the spin lock is dropped during inode write out. Whether your ocfs patch lands or this bit gets reworked as described above, the flag is gone and we are only left with I_FREEING. Hiding this behind a proper accessor (letting you know what's up with the inode) should cover your concern (again see bottom of the e-mail for a longer explanation). > However, this exists because we have an odd behavior with inodes, we allow them > to have a 0 reference count and still be usable. This again is a pretty unfun > footgun, because generally speaking we want reference counts to be meaningful. > This is not an odd behavior. This in fact the idiomatic handling of objects which remain cached if there are no active users. I don't know about the entirety of the Linux kernel, but dentries are also handled the same way. I come from the BSD land but I had also seen my share of Solaris and I can tell you all of these also follow this core idea in places I looked. If anything deviating from this should raise eyebrows. I can however agree that the current magic flags + refcount do make for a buggy combination, but that's not an inherent property of using this method. > The problem with the way we reference inodes is the final iput(). The majority > of file systems do their final truncate of a unlinked inode in their > ->evict_inode() callback, which happens when the inode is actually being > evicted. This can be a long process for large inodes, and thus isn't safe to > happen in a variety of contexts. Btrfs, for example, has an entire delayed iput > infrastructure to make sure that we do not do the final iput() in a dangerous > context. We cannot expand the use of this reference count to all the places the > inode is used, because there are cases where we would need to iput() in an IRQ > context (end folio writeback) or other unsafe context, which is not allowed. > I don't believe ->i_obj_count is needed to facilitate this. Suppose iput() needs to become callable from any context, just like fput(). What it can do is atomically drop the ref it is not the last one or punt all of it to task_work/a dedicated task queue. Basically same thing as fput(), except the ref is expected to be dropped by the code doing deferred processing if ->i_count == 1. Note that with your patchset iput() still takes spinlocks, which prevents it from being callable from IRQs at least. But suppose ->i_obj_count makes sense to add. Below I explain why I disagree with the way it is done. > To that end, resolve this by introducing a new i_obj_count reference count. This > will be used to control when we can actually free the inode. We then can use > this reference count in all the places where we may reference the inode. This > removes another huge footgun, having ways to access the inode itself without > having an actual reference to it. The writeback code is one of the main places > where we see this. Inodes end up on all sorts of lists here without a proper > reference count. This allows us to protect the inode from being freed by giving > this an other code mechanisms to protect their access to the inode. > I read through writeback vs iput() handling and it is very oddly written, indeed looking fishy. I don't know the history here, given the state of the code I 300% believe there were bugs in terms of lifetime management/racing against iput(). But the crux of what the code is doing is perfectly sane and in fact what I would expect to happen unless there is a good reason not to. The crucial point here is setting up the inode for teardown (and thus preventing new refs from showing up) and stalling it as long as there are pending consumers. That way they can still safely access everything they need. For this work the code needs a proper handshake (if you will), which *is* arranged with locking -- writeback (or other code with similar needs) either wins against teardown and does the write or loses and pretends the inode is not there (or fails to see it). If writeback wins, teardown waits. This only needs readable helpers to not pose a problem, which is not hard to implement. Note your patchset does not remove the need to do this, it merely possibly simplifies clean up after (but see below). This brings me to the problem with how ->i_obj_count is proposed. In this patchset it merely gates the actual free of the inode, allowing all other teardown to progress. Suppose one was to use ->i_obj_count in writeback to guarantee inode liveness -- worst case iobj_put() from writeback ends up freeing the inode. As mentioned above, the first side of the problem is still there with your patchset: you still need to synchronize against writeback starting to work on the inode. But let's assume the other side -- just the freeing -- is now sorted out with the count. The problem with it is the writeback code historically was able to access the entire of the inode. With teardown progressing in parallel this is no longer true an what is no longer accessible depends entirely on timing. If there are "bad" accesses, you are going to find the hard way. In order to feel safe here one would need to audit the entire of writeback code to make sure it does not do anything wrong here and probably do quite a bit of fuzzing with KMSAN et al. Furthermore, imagine some time in the future one would need to add something which needs to remain valid for the duration of writeback in progress. Then you are back to the current state vs waiting on writeback or you need to move more things around after i_obj_count drops to 0. Or you can make sure iput() can safely wait for a wakeup from writeback and not worry about a thorough audit of all inode accessess nor any future work adding more. This is the current approach. General note is that a hold count merely gating the actual free invites misuse where consumers race against teardown thinking something is still accessible and only crapping out when they get unlucky. The ->i_obj_count refs/puts around hash and super block list manipulation only serve as overhead. Suppose they are not there. With the rest of your proposal it is an invariant that i_obj_count is at least 1 when iput() is being called. Meaning whatever refs are present or not on super block or the hash literally play no role. In fact, if they are there, it is an invariant they are not the last refs to drop. Even in the btrfs case you are just trying to defer actual free of the inode, which is not necessarily all that safe in the long run given the remarks above. But suppose for whatever reason you really want to punt ->evict_inodes() processing. My suggestion would be the following: The hooks for ->evict_inodes() can start returning -EAGAIN. Then if you conclude you can't do the work in context you got called from, evict() can defer you elsewhere and then you get called from a spot where you CAN do it, after which the rest of evict() is progressing. Something like: the_rest_of_evict() { if (S_ISCHR(inode->i_mode) && inode->i_cdev) cd_forget(inode); remove_inode_hash(inode); .... } /* runs from task_work, some task queue or whatever applicable */ evict_deferred() { ret = op->evict_inode(inode); BUG_ON(ret == -EAGAIN); the_rest_of_evict(inode); } evict() { .... if (op->evict_inode) { ret = op->evict_inode(inode); if (ret == -EAGAIN) { evict_defer(inode); return; } } else { truncate_inode_pages_final(&inode->i_data); clear_inode(inode); } the_rest_of_evict(inode); } Optionally ->evict_inodes() func can get gain an argument denoting who is doing the call (evict() or evict_deferred()). > With this we can separate the concept of the inode being usable, and the inode > being freed. [snip] > With not allowing inodes to hit a refcount of 0, we can take advantage of that > common pattern of using refcount_inc_not_zero() in all of the lockless places > where we do inode lookup in cache. From there we can change all the users who > check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they > aren't allowed to do their work, othrwise they can proceed as normal. But this is already doable, just avoidably open-coded. In your patchset this is open-coded with icount_read() == 0, which is also leaking state it should not. You could hide this behind can_you_grab_a_ref(). On the current kernel the new helper would check the count + flags instead. Your consumers which no longer openly do it in this patchset would look the same. So here is an outline of what I suggest. First I'm going to talk about sorting out ->i_state and then about inode transition tracking. Accesses to ->i_state are open-coded everywhere, some places use READ_ONCE/WRITE_ONCE while others use plain loads/stores. None of this validates whether ->i_lock is held and for cases where the caller is fine with unstable flags, there is no way to validate this is what they are signing up for (for example maybe the place assumes ->i_lock is in fact held?). As an absolute minimum this should hide behind 3 accessors: 1. istate_store, asserting the lock is held. WRITE_ONCE 2. istate_load, asserting the lock is held. READ_ONCE or plain load 3. istate_load_unlocked, no asserts. the consumer explicitly spells out they understand the value can change from under them. another READ_ONCE to prevent the compiler from fucking with reloads. Maybe hide the field behind a struct so that spelled out i_state access fails to compile (similarly to how atomics are handled). Suppose the I_WILL_FREE flag got sorted out. Then the kernel is left with I_NEW, I_CLEAR, I_FREEING and maybe something extra. I think this is much more manageable but still primitive. An equivalent can be done with enums in a way which imo is much more handy. Then various spots all over the VFS layer can validate they got a state which can be legally observed for their usage. Note mere refcount being 0 or not does not provide that granularity as a collection of flags or an enum. For illustrative purposes, suppose: DEAD -- either hanging out after rcu freed or never used to begin with UNDER_CONSTRUCTION -- handed out by the allocator, still being created. invalid (equivalent to I_NEW?) CONSTRUCTED -- all done (equivalent to no flags?) DESTROYING -- equivalent to I_FREEING? With this in place it is handy to validate that for example you are transitionting from CONSTRUCTED to DESTROYING, but not from CONSTRUCTED to DEAD. You can also assert no UNDER_CONSTRUCTION inode escaped into the wild (this would happen in various vfs primitives, e.g., prior to taking the inode rwsem) This is all equivalent to the flag manipulation, except imo clearer. Suppose the flags are to stay. They can definitely hide behind helpers, there is no good reason for anyone outside of fs.h or inode.c to know about their meaning. I claim the enums *can* escape as they can be easily reasoned about. So... I don't offer to do any of this, I hope I made a convincing case against the patchset at least. Cheers.