Re: [PATCH v2 00/54] fs: rework inode reference counting

Mateusz Guzik <mjguzik@xxxxxxxxx> · Tue, 2 Sep 2025 12:06:01 +0200

On Tue, Aug 26, 2025 at 11:39:00AM -0400, Josef Bacik wrote:

Hi Josef,

I read through the entire patchset and I think I got the hang of it.

Bottom line is I disagree with the core idea of the patchset and
majority of the justification raised in the cover letter. :)

I'll be very to the point, trying to be as clear as possible and
consequently lacking in soft-speak. Based on your name I presume you are
also of Slavic descent, hopefully making it fine ;-)

I don't have a vote per se so this is not really a NAK. Instead I'm
making a case to you and VFS maintaienrs to not include this.

ACHTUNG: this is *really* long and I probably forgot to mention
something.

Frankly the patchset seems to be a way to help btrfs by providing a new
refcount (but not in a generic-friendly manner) while taking issue with
refcount 0 having a "the inode is good to go if need be" meaning. I
provide detailed reasoning below.

It warrants noting there is a lot of plain crap in the VFS layer.
Between the wtf flags, bad docs for them, poor assert coverage,
open-coded & repeated access to stuff (including internal state), I have
to say someone(tm) needs to take a hammer to it.

However, as far as I can tell, modulo the quality of how things are
expressed in the code (so to speak), the crux of what the layer is doing
in terms of inode management follows idiomatic behavior I would expect
to see, I just needs to be done better.

While there are perfectly legitimate reasons to introduce a "hold"
reference counter, I pose the patchset at hand does not justify its
introduction. If anything I will argue it would be a regression to do it
the way it is proposed here, even if some variant of the new counter
will find a use case.

> This series is the first part of a larger body of work geared towards solving a
> variety of scalability issues in the VFS.
> 

Elsewhere in the thread it is mentioned that there is a plan to remove
the inode LRU and replace the inode hash with xarray after these changes.

I don't understand how this patchset paves the way for either of those
things.

If anything, per notes from other people, it would probably be best if
the inode LRU got removed first and this patchset got rebased on it (if
it is to land at all).

For the inode hash the real difficulty is not really in terms of
implementing something, but evaluating available options. Even if the
statically-allocated hash should go (it probably should), the hashing
function is not doing a good job (read: the hash is artificially
underperforming) and merely replacing it with something else might not
give an accurate picture whether the new pick for the data structure is
genuinely the right choice (due to skewed comparison as the hash is
gimped, both in terms of hashing func and global locking).

The minor technical problem which is there in the stock kernel and which
remains unaddressed by your patchset is the need to take ->i_lock. Some
of later commentary in this cover letter claims this is sorted out,
but that's only true if someone already has a ref (as in the lock is
only optionally ommitted).

In particular, if one was to implement fine-grained locking for the hash
with bitlocks, I'm told the resulting ordering of bitlock -> spinlock
would be problematic on RT kernels as the former type is a hack which
literally only spins and does not support any form of preemption. The
ordering can be swapped around to spinlock -> bitlock thanks to RCU
(e.g., for deletion from the hash you would find the inode using RCU
traversal, lock it, lock the chain and only then delete etc.).

Since your patchset keeps the lock in place, the kernel is in the same
boat in both cases (also if the new thing only uses spinlocks).

As far as I know the other non-fs specific bottlenecks for inode
handling are the super block list and dentry LRU, neither of which
benefit from the patchset either.

So again I don't see how scalability work is facilitated by this patchset.

> We have historically had a variety of foot-guns related to inode freeing.  We
> have I_WILL_FREE and I_FREEING flags that indicated when the inode was in the
> different stages of being reclaimed.  This lead to confusion, and bugs in cases
> where one was checked but the other wasn't.  Additionally, it's frankly
> confusing to have both of these flags and to deal with them in practice.
> 

Per my opening remark I agree this situation is very poorly handled in
the current code.

If my grep is right the only real consumer of I_WILL_FREE is ocfs2. In
your patchset your just remove the usage. Given that other filesystems
manage without it, I suspect the real solution is to change its
->drop_inode to generic_delete_inode() and handle the write in
->evict_inode.

The doc for the flag is most unhelpful, documenting how the flag is used
but not explaining what for.

If I understood things correctly the flag is only there to prevent
->i_count acquire by other threads while the spin lock is dropped during
inode write out.

Whether your ocfs patch lands or this bit gets reworked as described
above, the flag is gone and we are only left with I_FREEING.

Hiding this behind a proper accessor (letting you know what's up with
the inode) should cover your concern (again see bottom of the e-mail for
a longer explanation).

> However, this exists because we have an odd behavior with inodes, we allow them
> to have a 0 reference count and still be usable. This again is a pretty unfun
> footgun, because generally speaking we want reference counts to be meaningful.
> 

This is not an odd behavior. This in fact the idiomatic handling of
objects which remain cached if there are no active users. I don't know
about the entirety of the Linux kernel, but dentries are also handled
the same way.

I come from the BSD land but I had also seen my share of Solaris and I
can tell you all of these also follow this core idea in places I looked.

If anything deviating from this should raise eyebrows.

I can however agree that the current magic flags + refcount do make for
a buggy combination, but that's not an inherent property of using this
method.

> The problem with the way we reference inodes is the final iput(). The majority
> of file systems do their final truncate of a unlinked inode in their
> ->evict_inode() callback, which happens when the inode is actually being
> evicted. This can be a long process for large inodes, and thus isn't safe to
> happen in a variety of contexts. Btrfs, for example, has an entire delayed iput
> infrastructure to make sure that we do not do the final iput() in a dangerous
> context. We cannot expand the use of this reference count to all the places the
> inode is used, because there are cases where we would need to iput() in an IRQ
> context  (end folio writeback) or other unsafe context, which is not allowed.
> 

I don't believe ->i_obj_count is needed to facilitate this.

Suppose iput() needs to become callable from any context, just like
fput().

What it can do is atomically drop the ref it is not the last one or punt
all of it to task_work/a dedicated task queue.

Basically same thing as fput(), except the ref is expected to be dropped
by the code doing deferred processing if ->i_count == 1.

Note that with your patchset iput() still takes spinlocks, which
prevents it from being callable from IRQs at least.

But suppose ->i_obj_count makes sense to add. Below I explain why I
disagree with the way it is done.

> To that end, resolve this by introducing a new i_obj_count reference count. This
> will be used to control when we can actually free the inode. We then can use
> this reference count in all the places where we may reference the inode. This
> removes another huge footgun, having ways to access the inode itself without
> having an actual reference to it. The writeback code is one of the main places
> where we see this. Inodes end up on all sorts of lists here without a proper
> reference count. This allows us to protect the inode from being freed by giving
> this an other code mechanisms to protect their access to the inode.
> 

I read through writeback vs iput() handling and it is very oddly
written, indeed looking fishy.  I don't know the history here, given the
state of the code I 300% believe there were bugs in terms of lifetime
management/racing against iput().

But the crux of what the code is doing is perfectly sane and in fact
what I would expect to happen unless there is a good reason not to.

The crucial point here is setting up the inode for teardown (and thus
preventing new refs from showing up) and stalling it as long as there
are pending consumers. That way they can still safely access everything
they need.

For this work the code needs a proper handshake (if you will), which
*is* arranged with locking -- writeback (or other code with similar
needs) either wins against teardown and does the write or loses and
pretends the inode is not there (or fails to see it). If writeback wins,
teardown waits. This only needs readable helpers to not pose a problem,
which is not hard to implement.

Note your patchset does not remove the need to do this, it merely
possibly simplifies clean up after (but see below).

This brings me to the problem with how ->i_obj_count is proposed. In
this patchset it merely gates the actual free of the inode, allowing all
other teardown to progress.

Suppose one was to use ->i_obj_count in writeback to guarantee inode
liveness -- worst case iobj_put() from writeback ends up freeing the
inode.

As mentioned above, the first side of the problem is still there with
your patchset: you still need to synchronize against writeback starting
to work on the inode.

But let's assume the other side -- just the freeing -- is now sorted out
with the count.

The problem with it is the writeback code historically was able to
access the entire of the inode. With teardown progressing in parallel
this is no longer true an what is no longer accessible depends entirely
on timing. If there are "bad" accesses, you are going to find the hard
way.

In order to feel safe here one would need to audit the entire of
writeback code to make sure it does not do anything wrong here and
probably do quite a bit of fuzzing with KMSAN et al.

Furthermore, imagine some time in the future one would need to add
something which needs to remain valid for the duration of writeback in
progress. Then you are back to the current state vs waiting on writeback
or you need to move more things around after i_obj_count drops to 0.

Or you can make sure iput() can safely wait for a wakeup from writeback
and not worry about a thorough audit of all inode accessess nor any
future work adding more. This is the current approach.

General note is that a hold count merely gating the actual free invites
misuse where consumers race against teardown thinking something is still
accessible and only crapping out when they get unlucky.

The ->i_obj_count refs/puts around hash and super block list
manipulation only serve as overhead. Suppose they are not there. With
the rest of your proposal it is an invariant that i_obj_count is at
least 1 when iput() is being called. Meaning whatever refs are present
or not on super block or the hash literally play no role. In fact, if
they are there, it is an invariant they are not the last refs to drop.

Even in the btrfs case you are just trying to defer actual free of the
inode, which is not necessarily all that safe in the long run given the
remarks above.

But suppose for whatever reason you really want to punt ->evict_inodes()
processing.

My suggestion would be the following:

The hooks for ->evict_inodes() can start returning -EAGAIN. Then if you
conclude you can't do the work in context you got called from, evict()
can defer you elsewhere and then you get called from a spot where you
CAN do it, after which the rest of evict() is progressing.

Something like:

the_rest_of_evict() {
        if (S_ISCHR(inode->i_mode) && inode->i_cdev)
                cd_forget(inode);

        remove_inode_hash(inode);
	....
}

/* runs from task_work, some task queue or whatever applicable */
evict_deferred() {
	ret = op->evict_inode(inode);
	BUG_ON(ret == -EAGAIN);
	the_rest_of_evict(inode);
}

evict() {
	....
        if (op->evict_inode) {
                ret = op->evict_inode(inode);
		if (ret == -EAGAIN) {
			evict_defer(inode);
			return;
		}
        } else {
                truncate_inode_pages_final(&inode->i_data);
                clear_inode(inode);
        }

	the_rest_of_evict(inode);
}

Optionally ->evict_inodes() func can get gain an argument denoting who
is doing the call (evict() or evict_deferred()).

> With this we can separate the concept of the inode being usable, and the inode
> being freed. 
[snip]
> With not allowing inodes to hit a refcount of 0, we can take advantage of that
> common pattern of using refcount_inc_not_zero() in all of the lockless places
> where we do inode lookup in cache.  From there we can change all the users who
> check I_WILL_FREE or I_FREEING to simply check the i_count. If it is 0 then they
> aren't allowed to do their work, othrwise they can proceed as normal.

But this is already doable, just avoidably open-coded.

In your patchset this is open-coded with icount_read() == 0, which is
also leaking state it should not.

You could hide this behind can_you_grab_a_ref().

On the current kernel the new helper would check the count + flags
instead.

Your consumers which no longer openly do it in this patchset would look
the same.

So here is an outline of what I suggest. First I'm going to talk about
sorting out ->i_state and then about inode transition tracking.

Accesses to ->i_state are open-coded everywhere, some places use
READ_ONCE/WRITE_ONCE while others use plain loads/stores. None of this
validates whether ->i_lock is held and for cases where the caller is
fine with unstable flags, there is no way to validate this is what they
are signing up for (for example maybe the place assumes ->i_lock is in
fact held?).

As an absolute minimum this should hide behind 3 accessors:

1. istate_store, asserting the lock is held. WRITE_ONCE
2. istate_load, asserting the lock is held. READ_ONCE or plain load
3. istate_load_unlocked, no asserts. the consumer explicitly spells out
they understand the value can change from under them. another READ_ONCE
to prevent the compiler from fucking with reloads.

Maybe hide the field behind a struct so that spelled out i_state access
fails to compile (similarly to how atomics are handled).

Suppose the I_WILL_FREE flag got sorted out.

Then the kernel is left with I_NEW, I_CLEAR, I_FREEING and maybe
something extra.

I think this is much more manageable but still primitive.

An equivalent can be done with enums in a way which imo is much more
handy.

Then various spots all over the VFS layer can validate they got a state
which can be legally observed for their usage. Note mere refcount being
0 or not does not provide that granularity as a collection of flags or
an enum.

For illustrative purposes, suppose:
DEAD -- either hanging out after rcu freed or never used to begin with
UNDER_CONSTRUCTION -- handed out by the allocator, still being created.
invalid (equivalent to I_NEW?)
CONSTRUCTED -- all done (equivalent to no flags?)
DESTROYING -- equivalent to I_FREEING?

With this in place it is handy to validate that for example you are
transitionting from CONSTRUCTED to DESTROYING, but not from CONSTRUCTED
to DEAD.

You can also assert no UNDER_CONSTRUCTION inode escaped into the wild
(this would happen in various vfs primitives, e.g., prior to taking the
inode rwsem)

This is all equivalent to the flag manipulation, except imo clearer.

Suppose the flags are to stay. They can definitely hide behind helpers,
there is no good reason for anyone outside of fs.h or inode.c to know
about their meaning.

I claim the enums *can* escape as they can be easily reasoned about.

So... I don't offer to do any of this, I hope I made a convincing case
against the patchset at least.

Cheers.