Re: [GIT PULL] bcachefs fixes for 6.16-rc3

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sat, 21 Jun 2025 19:46:49 -0400

Thanks for the report.

I'd like to add, bcachefs is the _lot_ of people investing a ton of time
QAing this thing. There are a lot of users like Jérôme that I've worked
with for extended periods tracking down all sorts of crazy stuff, and I
thank them for their patience and all their help.

It's been a true community effort.

On Sat, Jun 21, 2025 at 05:07:51PM -0400, Jérôme Poulin wrote:
> The filesystem is very resilient at being rebooted anywhere, anytime.
> It went through many random resets during any of..  fsck repairs, fsck
> rebuilding the btree from scratch, upgrades, in the middle of snapshot
> operations, while replaying journal.  It just always recovers at
> places I wouldn't expect to be able to hit the power switch. Worst
> case, it mounted read-only and needed fsck but could always be mounted
> read-only.

That's the dream :)

I don't think the filesystem should ever fail in a way that leads to
data loss, and I think this is a more than achievable goal.

> Where things get a bit more touchy is when combining all those
> features together;  operations tend to be a bit "racy" between each
> other and tend to lock up when there's multiple features running/being
> used in parallel.  I think this is where we get to the "move fast
> break things" part of the filesystem.  The foundation is solid, read,
> write, inode creations/suppression, bucket management, all basic posix
> operations, checksums, scrub, device addition. Many of the
> bcachefs-specific operations are stable, being able to set compression
> and replication level and data target per folder is awesome stuff and
> works well.

It's not "move fast and break things", we haven't had a problem with
regressions that I've seen.

It's just a project with massive scope, and it takes awhile to find all
the corner cases and make sure there's no pathalogical behaviour in any
scenario.

> From my experience, what is less polished are; snapshots and snapshot
> operations, reflink, nocow, multiprocess heavy workloads, those seem
> to be where the "experimental" part of the filesystem goes into the
> spotlight.

This mostly fits with what I've been seeing; exception being that I
haven't seen any major issues with reflink in ages (you mentioned a
reflink corruption earlier, are you sure that was reflink?).

And rebalance (background data movement) has taken awhile to make
polished, and we're still not done - I think as of 6.16 we've got all
the outright bugs I know of fixed, but there's still behaviour that's
less than ideal (charitably) - if you ask it to move more data to a
target than fits, it'll spin (no longer wasting IO, though). That one
needs some real work to fix properly - another auxiliary index of
"pending" extents, extents that rebalance would like to move but can't
until something changes.

Re: multiprocess workloads, those livelock-ish behaviour have been the
most problematic to track down - but we made some recent progress on
understanding where they're coming from, and the new btree iterator
tracepoints should help.

The new error_throw tracepoint is also already proving useful for
tracking down wonky behaviour (just not the one you're talking about).

> I've been running rotating snapshots on many machines, it
> works well until it doesn't and I need to reboot or fsck. Reflink
> before 6.14 seemed a bit hacky and can result in errors. Nocow tends
> to lock up but isn't really useful with bcachefs anyway. Maybe
> casefolding which might not be fully tested yet. Those are the true
> experimental features and aren't really labelled as such.

Casefolding still has a strange rename bug. Some of the recent self
healing work was partly to make it easier to track down - we now will
notice that something went wrong and pop a fsck error on the first 'ls'
of an improperly renamed dirent.

> We can always say "yes, this is fixed in master, this is fixed in
> 6.XX-rc4" but it is still experimental and tends to be what causes the
> most pain right now.  I think this needs to be communicated more
> clearly. If the filesystem goes off experimental, I think a subset of
> features should be gated by filesystem options to reduce the need for
> big and urgent rc patches.

Yeah, this is coming up more as the userbase grows.

For the moment doing more backports is infeasible due to sheer volume,
but I expect this to be changing soon - 6.17 is when I expect to start
doing more backports.

> The problem is...  when the experimental label is removed, it needs to
> be very clear that users aren't expected to be running the latest rc
> and master branch.  All the features marked as stable should have
> settled enough that there won't be 6 users requiring a developer to
> mount their filesystem read-write or recover files from a catastrophic
> race condition.

Correct. Stable backports will start happening _before_ the experimental
is lifted

> This is where communication needs to be clear, bcachefs website,
> tools, options; should all clearly label features that might require
> someone to ask a developer's help or to run the latest release
> candidate or a debug version of the kernel.

Everything just needs to be solid before the experimental label is
lifted. I don't want users to have to check a website to know what's
safe to use.