Re: [PATCH RFC 5/6] fs: introduce a shutdown_bdev super block operation

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Wed, 25 Jun 2025 06:36:19 +0930

在 2025/6/24 19:45, Christian Brauner 写道:
On Tue, Jun 24, 2025 at 07:21:50PM +0930, Qu Wenruo wrote:

在 2025/6/24 18:43, Christian Brauner 写道:
[...]
It's not hard for btrfs to provide it, we already have a check function
btrfs_check_rw_degradable() to do that.

Although I'd say, that will be something way down the road.

Yes, for sure. I think long-term we should hoist at least the bare
infrastructure for multi-device filesystem management into the VFS.

Just want to mention that, "multi-device filesystem" already includes fses
with external journal.

Yes, that's what I meant below by "We've already done a bit of that".
It's now possible to actually reach all devices associted with a
filesystem from the block layer. It works for xfs and ext4 with
journal fileystems. So for example, you can freeze the log device and
the main device as the block layer is now able to find both and the fs
stays frozen until both have been unfrozen. This wasn't possible before
the rework we did.

Now follows a tiny rant not targeted at you specifically but something
that still bugs me in general:

We had worked on extending this to btrfs so that it's all integrated
properly with the block layer. And we heard long promises of how you
would make that switch happen but refused us to let us make that switch.
So now it's 2 years later and zero happend in that area.

I believe we (the btrfs community) are not intentionally rejecting this, 
bad luck and lack of review bandwidth is involved, as at that time we're 
focusing on migrating to the new fsconfig mount interface.

That delayed review of the original patchset from Christoph, then the 
old series no longer applies due to the fsconfig change, until Johannes 
revived the series for the first time.

Then even more (minor) conflicts with recent mount ro/rw mount fixes, 
and although Johannes tried his best to refresh the series, those 
conflicts eventually resulted test failures.

And I wasn't even following all those updates, until one day I'm 
eventually freed from btrfs large folio support, and had time to attack 
the long failure generic/730 due to lack of shutdown support.

Then I was dragged into the rabbit hole and finally we're here.

Also I have to admit, at least me do not have much experience in the 
block/VFS field, and sometimes we still assume the existing 
infrastructure is still mostly targeting single-ish block device 
filesystems, but it's not true anymore.

We're improving this, and got quite some help from Christoph, e.g. he 
contributed the btrfs bio layer to do all the bio split/chain inside btrfs.

I hope this remove_bdev() call back can be a good start point to bridge 
the btrfs and block community closer.

That also means block device freezing on btrfs is broken. If you freeze
a block device used by btrfs via the dm (even though unlikely) layer you
freeze the block device without btrfs being informed about that.

Yes, you're totally right and I also believe that may be the reason of 
btrfs corruption after hibernation/suspension.

It also means that block device removal is likely a bit yanky because
btrfs won't be notified when any device other than the main device is
suddenly yanked. You probably have logic there but the block layer can
easily inform the filesystem about such an event nowadays and let it
take appropriate action.

Yep, btrfs doesn't handling removal of devices at runtime at all, but 
still tries to do IO on that device, only saved by the extra mirrors.

Meaning unless a user is monitoring the dmesg, one won't notice the 
problem, which a huge degradation of availability happening silently.

And fwiw, you also don't restrict writing to mounted block devices.
That's another thing you blocked us from implementing even though we
sent the changes for that already and so we disabled that in
ead622674df5 ("btrfs: Do not restrict writes to btrfs devices"). So
you're also still vulnerable to that stuff.

Oh, that's something new and let me explore this after all the 
remove_bdev() callback thing.

Thanks,
Qu

Thus the new callback may be a good chance for those mature fses to explore
some corner case availability improvement, e.g. the loss of the external
journal device while there is no live journal on it.

Already handled for xfs and ext4 cleanly since our rework iiuc.

(I have to admin it's super niche, and live-migration to internal journal
may be way more complex than my uneducated guess)

Thanks,
Qu

Or we should at least explore whether that's feasible and if it's
overall advantageous to maintenance and standardization. We've already
done a bit of that and imho it's now a lot easier to reason about the
basics already.

We even don't have a proper way to let end user configure the device loss
behavior.
E.g. some end users may prefer a full shutdown to be extra cautious, other
than continue degraded.

Right.