Re: [GIT PULL] bcachefs fixes for 6.16-rc3

Jérôme Poulin <jeromepoulin@xxxxxxxxx> · Sat, 21 Jun 2025 17:07:51 -0400

As a bcachefs user who has been following this discussion, I'd like to
share my perspective on the current state of the filesystem and the
path forward.

I'm currently using this filesystem for a backup staging server so it
is easy for me to make sure data isn't getting lost and can verify
checksums at the application levels from time to time. The solution
uses snapshots extensively as well as replication, reflinks and
background compression.

I really like this filesystem for multiple reasons, it fills the gap
for missing features of traditional filesystems, it allows integrating
cache devices almost seamlessly, it allows having metadata on local
devices while still pushing to slow HDD, SMR or network devices
without having to setup something like Ceph or a stack like Btrfs,
mdadm and nbd/iSCSI.  I've seen all the features appear one by one on
Bcachefs and it is growing fast.

I migrated from Btrfs after an incident with a RAID controller losing
its cache that caused the filesystem to be unmountable and
unrepairable.  Btrfs restore was able to recover *most* of the files
on that server except a couple subvolumes which had to be recreated by
the backup system.  And again, since this is a staging area for
backups, I don't need 100% uptime or a guarantee that my files won't
be lost so I felt pretty confident in using Bcachefs to speed up
operations there.

Bcachefs was able to triple the speed of the backup system by having
metadata stored in NVMe + passively caching all writes to NVMe.  The
last part of the backup is now blazing fast since everything is in
NVMe.

At this point in time, I do believe Bcachefs has solid foundations, as
of now, the only data corruption that lost me some files were related
to a snapshot deletion bug for a feature that was not yet published to
mainline.

It hasn't been without its downsides, many times I had to take the
filesystem for offline repair and Kent was always able to figure out
the root cause of issues causing the FS not to mount read-write and
issue a patch for the FS and for fsck.  We found many weird bugs
together, ARM specific bugs, reflink causing corruption, resize not
allocating buckets, many races and lock ups, upgrade not finishing
correctly, corruption from weird interactions, data not staying cached
when there's no promote_target.  All of this was fixed without much
more damage than the last operations being lost and most were fixed
really quickly from cat'ing a couple diagnostic files, using perf or
worst case metadata image.

The filesystem is very resilient at being rebooted anywhere, anytime.
It went through many random resets during any of..  fsck repairs, fsck
rebuilding the btree from scratch, upgrades, in the middle of snapshot
operations, while replaying journal.  It just always recovers at
places I wouldn't expect to be able to hit the power switch. Worst
case, it mounted read-only and needed fsck but could always be mounted
read-only.

It also went through losing 6 devices and the write-back cache (that
defective controller, again).  Fsck could repair it with minimal loss
related to recent data. A lot of scary messages in fsck, but it
finished and I could run scrub+rereplicate to finish it off (which
fixed a couple more files).

Where things get a bit more touchy is when combining all those
features together;  operations tend to be a bit "racy" between each
other and tend to lock up when there's multiple features running/being
used in parallel.  I think this is where we get to the "move fast
break things" part of the filesystem.  The foundation is solid, read,
write, inode creations/suppression, bucket management, all basic posix
operations, checksums, scrub, device addition. Many of the
bcachefs-specific operations are stable, being able to set compression
and replication level and data target per folder is awesome stuff and
works well.